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Chapter  1 
Introduction 


Broadcast  protocols  are  useful  tools  for  distributed  and  fault-tolerant  programming. 
However,  a  wide  range  of  such  protocols  has  been  proposed,  differing  in  their  fault 
tolerance  and  delivery  ordering  characteristics.  This  thesis  describes  techniques  for 
chosing  the  type  of  broadcast  that  will  maximize  the  performance  of  an  application 
without  compromising  its  correctness. 

This  work  was  motivated  by  the  ISIS  system,  a  toolkit  for  building  fault-tolerant 
distributed  applications.  All  tools  provided  by  ISIS  are  based  on  a  set  of  broadcast 
communication  primitives.  These  primitives  as  well  as  the  tools  built  from  them 
are  made  available  to  the  application  programmer.  One  objective  of  this  thesis  is 
to  gain  an  understanding  of  the  theoretical  foundations  of  the  ISIS  system,  and 
thereby  help  the  programmer  in  selecting  and  using  the  tools  provided  by  ISIS.  The 
interested  reader  is  referred  to  [BJ87a,BJS88]  for  a  description  of  ISIS. 
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Figure  1.1:  Distributed  system 


1.1  Distributed  Systems 

A  distributed  system  consists  of  a  set  of  independent  processors,  pi, . . . ,p„,  con¬ 
nected  by  a  communication  network  (see  Figure  1.1).  In  such  a  system,  processors 
exchange  information  only  by  sending  messages.  There  are  several  parameters  that 
determine  the  characteristics  of  the  system: 

•  Network  topology:  Some  pairs  of  processors  can  communicate  directly. 
Messages  between  other  pairs  of  processors  have  to  be  routed  through  one 
or  more  intermediate  processors.  A  network  in  which  every  pair  of  processors 
can  communicate  directly  is  called  completely  connected. 
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•  Message  ordering:  Some  communication  protocols  do  not  make  any  guaran¬ 
tees  about  the  order  in  which  messages  are  delivered.  Other  protocols  provide 
FIFO  ordering,  i.e.,  messages  are  delivered  in  the  order  they  were  sent. 

•  Message  reliability:  Message  channels  may  be  reliable  (all  messages  sent 
are  delivered  correctly),  subject  to  omission  failures  (messages  may  be  lost), 
or  subject  to  Byzantine  failures  (messages  may  be  lost  or  corrupted).  Fur¬ 
thermore,  there  may  or  may  not  be  an  upper  bound  on  the  time  between  the 
sending  of  a  message  and  its  delivery. 

•  Processor  reliability:  A  processor  may  fail  in  several  ways.  It  may  stop  with¬ 
out  taking  incorrect  actions  ( fail-stop )  [SS83],  fail  to  send  or  receive  some  mes¬ 
sages  ( omission  fault )  [Had84],  or  behave  arbitrarily  ( Byzantine  fault)  [LSP82, 
SD83J. 

In  this  His.wtii.tinn  we  assume  a  completely  connected  network  with  reliable  message 
delivery.  This  decision  is  justifiable  on  practical  grounds:  data-link  protocols  and 
network  routing  protocols  satisfying  these  assumptions  are  well  understood  [Tan81]. 
In  general,  we  do  not  assume  an  upper  bound  on  message  delays,  nor  do  we  assume 
that  processors  have  synchronized  clocks.  A  system  with  these  characteristics  (un¬ 
bounded  message  delays,  no  synchronized  clocks)  is  called  asynchronous. 

Processors  may  experience  failures,  but  we  restrict  ourselves  to  non-byzantine 
failure  modes.  Processor  omission  faults  can  be  treated  in  the  same  way  as  the 
loss  of  messages  in  the  communication  network.  Therefore,  we  consider  only  crash 
failures  (fail-stop  processors). 


1.2  Objectives 


Many  applications  running  in  a  distributed  system  require  processors  to  share  in¬ 
formation.  Often  it  is  also  desirable  to  replicate  information  at  different  sites  to 
avoid  data  loss  should  a  failure  occur.  Useful  tools  for  sharing  information  and  for 
maintaining  replicated  data  are  reliable  broadcast  protocols.  Such  protocols  prop¬ 
agate  information  from  one  processor  to  a  set  of  destination  processors  in  such  a 
way  that  all  operational  destinations  receive  this  information  despite  failures  in  the 
system.  This  property  is  called  reliable  message  delivery.  In  addition  to  this,  a 
broadcast  protocol  may  also  provide  a  form  of  message  ordering.  The  strongest 
form  is  atomic  ordering.  An  atomic  broadcast  protocol  guarantees  that  all  messages 
are  received  in  the  same  order  everywhere.  An  example  of  a  weaker  form  of  ordering 
is  FIFO.  A  FIFO  broadcast  guarantees  that  two  messages  seat  by  the  same  processor 
are  received  everywhere  in  the  order  they  were  sent.  Messages  sent  by  different 
processors,  however,  may  be  received  in  different  orders  at  different  sites. 

There  is  a  tradeoff  between  how  much  ordering  a  protocol  provides  and  how  much 
synchronization  delay  is  necessary  to  implement  this  ordering.  A  FIFO  broadcast, 
for  example,  can  be  implemented  efficiently  on  top  of  unordered  message  channels 
by  adding  a  sequence  number  to  every  message.  An  atomic  broadcast,  on  the  other 
hand,  is  much  more  costly  to  implement  in  the  systems  we  study.  It  requires  two 
or  more  phases  of  message  exchanges  between  processors  before  a  message  can 
be  delivered.  It  is,  therefore,  desirable  to  employ  protocols  that  support  only  a 
low  degree  of  ordering  whenever  possible.  This  dissertation  presents  techniques 
for  deciding  how  strongly  ordered  a  protocol  has  to  be  in  order  to  solve  a  given 
application  problem. 
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1.3  Outline 

This  dissertation  consists  of  seven  chapters.  Chapter  2  describes  several  different 
forms  of  broadcast  protocols  known  in  the  literature  said  discusses  their  benefits 
and  costs. 

Chapter  3  introduces  a  formalism  for  specifying  an  application  problem  in  a 
distributed  system,  and  presents  a  model  for  broadcast-based  implementations  that 
solve  such  problems. 

Chapter  4  investigates  conditions  under  which  a  specification  has  an  asyn¬ 
chronous  implementation.  It  is  shown  that  if  such  an  implementation  exists,  it 
can  be  expressed  in  a  canonical  form. 

Chapter  5  proves  that  in  general  the  existence  of  an  an  asynchronous  imple¬ 
mentation  for  a  given  problem  is  undecidable.  However,  we  identify  a  subclass 
of  specifications  that  captures  a  broad  range  of  practical  problems.  The  defining 
characteristic  of  specifications  in  this  class  is  that  they  have  certain  commutativ¬ 
ity  properties.  We  describe  methods  for  finding  asynchronous  implementations  for 
specifications  in  this  class. 

Chapter  6  examines  how  processor  failures  can  be  integrated  into  our  model  and 
shows  how  this  affects  the  results  of  Chapter  4  and  Chapter  5. 

Chapter  7  summarizes  our  results  and  discusses  future  extensions  of  our  work. 

Throughout  the  thesis  we  use  an  example  that  is  first  introduced  in  Chapter  3. 
Appendix  A  contains  a  comprehensive  presentation  of  this  example  in  which  we 
collect  the  different  elements  addressed  throughout  the  thesis  into  one  discussion. 


Chapter  2 

Reliable  Broadcast  Protocols 


Because  this  dissertation  is  about  selecting  among  different  forms  of  broadcast  pro¬ 
tocols,  we  devote  this  chapter  to  giving  an  overview  of  several  variants  of  broadcast 
protocol.1  Such  protocols  have  two  distinct  properties: 

•  Reliability:  The  protocol  ensures  that  a  message  that  is  broadcast  will  even¬ 
tually  reach  all  its  destinations,  even  if  failures  occur  while  the  protocol  is 
running. 

•  Ordering:  Some  protocols  make  guarantees  about  the  order  in  which  different 
broadcast  messages  are  received  at  different  destination  sites. 

We  will  describe  how  these  features  can  be  implemented  on  top  of  a  network  that 
provides  only  point-to-point  communication  between  processors.  In  our  discussion 
we  will  first  concentrate  on  the  ordering  aspect.  The  following  section  will  present 
different  ordering  properties  and  describe  how  such  properties  can  be  implemented 

JThe  term  “broadcast”  is  often  used  to  mean  that  a  message  is  sent  to  all  pro¬ 
cessors  in  the  system.  We  will  use  it  in  the  more  general  sense  of  sending  a  message 
to  some  subset  of  all  processors.  This  is  often  called  a  multicast 
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in  a  completely  reliable  system  in  which  processors  do  not  fail.  In  Section  2.2  we 
will  examine  how  the  different  types  of  protocols  can  be  made  fault-tolerant 

2.1  Broadcast  Ordering 

2.1.1  Unordered  Broadcast 

The  simplest  way  of  broadcasting  a  message  is  to  just  send  a  copy  of  that 
message  to  every  destination  processor  individually.  This  form  of  broadcast  does 
not  provide  any  form  of  ordering.  Figure  2.1  illustrates  this.  It  shows  a  system  with 
four  processors.  Time  proceeds  from  left  to  right,  and  the  diagonal  lines  represent 
messages.  The  figure  shows  pi  broadcasting  two  messages  (a  and  b )  to  pj,  p^,  and 
Pi.  The  two  messages  arrive  in  the  same  order  at  p%  and  pj,  but  p<  receives  them 
in  a  different  order.  Because  this  form  of  broadcast  does  not  guarantee  any  specific 
order  of  delivery,  we  call  it  an  tmordered  broadcast,  or  simply  BCAST. 

2.1.2  Fifo  Broadcast 

If  the  underlying  communication  network  provides  FIFO  message  channels,  then 
the  protocol  just  described  will  satisfy  a  stronger  ordering  property:  All  messages 
broadcast  by  the  same  processor  will  be  delivered  in  the  same  order  everywhere, 
namely  the  order  they  are  sent.  Even  if  the  network  does  not  provide  FIFO  mes¬ 
sage  channels,  it  is  not  difficult  to  implement  FIFO  ordering  by  adding  a  sequence 
number  to  every  message  [Tan81].  We  call  this  a  FIFO  broadcast,  or  FBCAST  for 
short.  In  Figure  2.2,  for  example,  processor  pi  broadcasts  two  messages,  first  a 
then  b.  Processors  ps  and  pt  receive  these  messages  in  the  order  they  were  sent. 
Broadcasts  B-i  and  B%,  however,  are  sent  by  different  processors;  such  broadcasts 
may  be  delivered  in  different  orders  at  different  sites,  as  shown  in  the  example. 


Figure  2.1:  Unordered  broadcast 


B\  :  a  #2 : b 


Figure  2.2:  FIFO  broadcast 
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2.1.3  Atomic  Broadcast 

ijioaucasts  are  often  used  for  updating  information  that  is  replicated  at  several 
sites.  FIFO  ordering  may  not  be  enough  if  different  processors  broadcast  update 
messages  independently.  In  this  situation,  two  update  messages  could  be  delivered 
in  different  orders  at  different  sites,  leading  to  inconsistencies.  This  can  be  avoided 
by  using  a  stronger  protocol  that  would  guarantee  that  all  messages  are  delivered 
in  the  same  order  everywhere,  even  if  they  were  sent  independently  by  different 
processors.  Such  a  protocol  is  called  an  atomic  broadcast  protocol,  or  ABCAST 
for  short.  Figure  2.3  illustrates  the  behavior  of  an  ABCAST.  It  shows  two  messages 
broadcast  independently  by  pi  and  pj.  Both  messages  are  received  in  the  same 
order  at  pz  and  (first  6,  then  a  in  this  example). 

There  are  several  well  known  techniques  for  implementing  ABCAST  in  asyn¬ 
chronous  systems.  Figure  2.4  illustrates  a  protocol  due  to  Chang  and  Maxemchuk 
[CM84]  in  which  every  message  is  broadcast  in  two  phases.  A  processor  wishing  to 
broadcast  a  message  sends  this  message  to  one  distinguished  processor,  say  pi  (first 
phase),  pi  then  forwards  the  message  to  its  destinations  by  means  of  a  fbcast 
(second  phase).  This  way  all  broadcast  messages  are  delivered  in  the  order  they 
were  received  and  forwarded  by  pi.  A  different,  more  symmetric  atomic  broadcast 
protocol  due  to  Skeen  is  described  in  [BJ87bj.  This  method  uses  a  three-phase 
protocol  as  illustrated  in  Figure  2.5.  Every  processor  maintains  a  message  delivery 
queue;  when  a  broadcast  message  is  received  (phase  one  of  the  protocol),  it  is  added 
to  the  queue,  but  not  yet  delivered  to  the  application  program  running  at  that  site. 
The  recipient  assigns  a  temporary  “priority  number”  to  the  message  and  returns 
this  number  to  the  sender  of  the  broadcast  (phase  two).  The  recipient  chooses  this 
number  to  be  larger  than  any  number  assigned  to  messages  currently  queued  or 
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B  i  :  a 


Figure  2.3:  Atomic  broadcast 


Figure  2.4:  Two-phase  implementation  of  atomic  broadcast 
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previously  delivered.  The  sender  collects  all  priority  numbers,  computes  their  max¬ 
imum  and  sends  this  number  to  all  destination  sites  (phase  three).  Every  recipient 
replaces  the  temporary  priority  number  by  the  number  just  received,  and  reorders 
the  queue  accordingly.  A  message  is  delivered  when  it  has  received  its  final  priority 
number  and  no  messages  with  smaller  priority  number  are  in  the  queue. 

Atomic  ordering  makes  the  design  of  fault- tolerant  distributed  applications  much 
easier,  because  it  reduces  the  uncertainty  caused  by  message  delays  and  failures  in 
the  system.  However,  this  benefit  does  not  come  without  cost.  The  two  protocols 
described  above  need  two  or  three  phases  of  communication  before  an  ABCAST 
message  can  be  delivered.  It  is  not  difficult  to  prove  that  in  an  asynchronous 
system  (i.e.,  a  system  with  unbounded  message  delays),  any  protocol  that  guarantees 
atomic  ordering  requires  some  messages  to  take  at  least  two  hops  before  they  are 
delivered.  Consider  for  example  a  system  with  two  processors,  p\  and  pj.  Processor 
pi  broadcasts  a  message  a;  at  the  same  time  pi  broadcasts  b.  Both  message  are 
addressed  to  both  processors.  We  claim  that  either  message  a  needs  at  least  two 
hops  (to  pi  and  back  to  pi )  before  it  can  be  delivered  at  pi ,  or  message  b  needs  two 
hops.  Assume  the  protocol  delivers  a  at  pi  in  one  hop.  This  means  that  pi  sends  a 
to  pi,  but  it  delivers  the  message  locally  without  waiting  for  a  reply  from  pi  (See 
Figure  2.6).  At  the  time  of  this  local  delivery,  pi  may  not  yet  know  that  pi  has  sent 
a  broadcast.  If  the  message  b  from  pi  to  pi  is  delayed  long  enough,  the  protocol 
will  deliver  a  before  b  at  p\.  Similarly,  it  is  possible  that  at  pj,  b  will  be  delivered 
before  a.  But  that  would  violate  atomic  ordering. 

The  situation  is  different  if  there  is  a  known  upper  bound  on  message  delays.  In 
such  a  system  it  is  possible  to  maintain  synchronized  clocks  [BD87,LMS85,ST87].  In 
this  case,  atomic  ordering  can  be  achieved  by  a  method  based  on  timestamps.  The 
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Bi  :  a 


Figure  2.5:  Three-phase  implementation  of  atomic  broadcast 


Figure  2.6:  Processor  pi  delivers  message  a  locally  without  waiting  for  any  messages 
from  p2. 
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sender  of  a  broadcast  adds  a  timestamp  to  the  message  that  shows  the  value  of  its 
local  clock  when  the  message  is  sent  out.  Messages  received  at  a  destination  site  are 
delivered  to  the  application  program  in  timestamp  order;  however,  before  a  message 
is  delivered,  the  processor  has  to  wait  until  it  is  certain  that  no  more  messages  with 
a  lower  timestamp  will  arrive.  The  amount  of  time  to  wait  depends  on  the  worst 
case  message  delay  and  on  how  closely  clocks  are  synchronized  [CASD84].  The 
disadvantage  of  this  approach  is  that  the  delivery  of  every  message  is  delayed  by 
the  worst  case  message  delay,  which  is  often  much  larger  than  the  average  delay  in 
a  two  or  three  phase  protocol. 

2.1.4  Causal  Broadcast 

Because  of  the  inherent  cost  of  atomic  broadcast  protocols  it  is  natural  to  look  for 
protocols  that  provide  stronger  ordering  than  FBCAST  but  are  less  expensive  than 
ABCAST.  The  causal  broadcast  (CBCAST  for  short)  is  such  a  protocol.  It  is  based 
on  the  idea  of  ■potential  causality  introduced  by  Lamport  in  [Lam78]. 

The  flow  of  information  during  the  execution  of  a  distributed  system  can  be 
used  to  define  a  partial  order  on  events  occurring  in  the  system.  Such  events  are 
the  sending  of  a  message ,  the  receipt  of  a  message ,  or  «.  local  event  that  only  affect 
a  single  processor.  Figure  2.7  illustrates  this.  Events  ei,C4,en,  and  eu  are  send- 
events ,  e3,es,e7,e8,eij,eij,  and  eu  are  receive- events,  and  e3,e»,  and  cio  are  local- 
events.  According  to  Lamport's  definition,  all  events  that  are  connected  by  a  path  in 
this  diagram  are  potentially  causally  related.  Such  a  path  must  follow  the  horizontal 
lines  (from  left  to  right)  or  message  arrows.  For  example,  eio  is  potentially  causally 
related  to  ei,  because  there  is  a  path  from  e\  to  eio  going  through  ej, e<,  and  eg 
(dotted  line  in  the  figure).  This  dependency  is  denoted  by  the  symbol  u— i.e., 
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Figure  2.7:  Potential  causality 


B\:a  Bg:b  B^'.d 


Figure  2.8:  Causal  broadcast 
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ei  — *  eio-  Events  that  are  not  connect  by  such  a  path  are  called  concurrent.  This 
is  denoted  by  the  symbol  “//”.  For  example,  C5//C14.  The  relation  »”  is  called 
potential  causality  or  information  flow  relation.  The  name  “potential  causality”  has 
the  following  explanation.  In  physics,  the  Principle  of  Causality  says  that  a  cause 
has  to  precede  its  effect.  Similarly,  an  event  a  in  a  distributed  system  can  affect  an 
event  b  at  some  other  processors  only  if  there  is  a  flow  of  information  from  a  to  6 , 
i.e.,  if  a  precedes  b  under 

The  ordering  properties  of  a  causal  broadcast  protocol  are  defined  in  terms  of 
this  information  flow  relation.  CBCAST  guarantees  that  every  processor  receives 
messages  in  an  order  that  is  consistent  with  .  That  is,  whenever  two  CBCAST 
send  events  are  related  by  the  protocol  ensures  that  the  two  messages  are 

received  in  the  same  order  everywhere,  namely  the  one  given  by  .  For  example, 
in  Figure  2.8,  broadcasts  £1  and  £3  are  potentially  causally  related  ( B\  -*  £2, 
represented  by  the  dotted  line  in  Figure  2.8).  Consequently  the  message  a  is  received 
before  c  at  both  pz  and  p\.  Broadcasts  £j  and  £4,  on  the  other  hand,  are  concurrent 
(£3//£4).  Hence  the  two  messages  c  and  d  may  be  received  in  different  orders  at 
pi  and  pi,  as  shown  in  this  example.  Notice  that  two  events  at  the  same  processor 
are  never  concurrent.  Therefore  a  causal  broadcast  also  respects  FIFO  ordering.  For 
example,  in  Figure  2.8,  B\  — ►  £4;  hence  message  a  is  received  everywhere  before  d. 

There  are  several  ways  of  implementing  causal  broadcast  that  are  very  similar 
to  the  use  of  sequence  numbers  in  FBCAST  protocols.  A  processor  wishing  to  broad¬ 
cast  a  message  adds  some  additional  dependency  information  to  the  message  before 
sending  it  to  its  destinations.  This  technique  is  called  “piggybacking”.  The  infor¬ 
mation  that  is  added  to  an  outgoing  message  m  consists  of  a  list  of  other,  previously 
received  messages  that  precede  m  under  u— This  form  of  CBCAST  protocol  is  de- 
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scribed  in  detail  by  Birman  and  Joseph  in  [BJ87b].  In  a  system  in  which  no  failures 
occur,  i'  -  Jfident  to  transmit  only  message- ID’s,  instead  of  piggybacking  whole 
messages  onto  other  messages  [Pet87],  Using  this  piggybacking  technique  causal 
ordering  can  be  achieved  without  multiple  phases  of  message  exchanges. 

2.2  Reliability 

The  broadcast  protocols  as  we  described  them  in  the  previous  section  only  work 
correctly  if  no  failures  occur.  Consider  for  example  the  BCAST  protocol.  If  the 
sender  crashes  in  the  middle  of  the  protocol,  the  message  will  reach  only  a  subset 
of  the  destination  sites.  The  situation  is  even  worse  for  the  three-phase  ABCAST 
protocol.  The  failure  of  a  single  destination  site  can  cause  the  protocol  to  block, 
preventing  all  other  broadcasts  from  being  received. 

So-called  reliable  broadcast  protocols  avoid  this  undesirable  behavior.  A  reliable 
broadcast  guarantees  that  every  message  sent  will  eventually  be  received  by  all 
operational  destination  sites,  despite  processor  failures.  We  have  to  qualify  this 
statement  a  little.  Under  certain  failure  patterns,  no  protocol  can  guarantee  the 
delivery  of  a  broadcast  to  all  operational  destinations.  For  example,  the  sender 
could  crash  before  it  actually  sent  out  any  messages.  Even  if  the  sender  manged 
to  communicate  with  some  other  processor  before  it  crashed,  this  other  processor 
could  experience  a  failure  before  talking  to  anybody  else.  In  general,  a  set  of  failures 
in  an  early  stage  of  a  broadcast  protocol  could  wipe  out  aU  knowledge  about  the 
message  to  be  sent.  What  we  mean  by  reliable  message  delivery  is  that  a  message  is 
delivered  to  all  operational  destinations  unless  the  sender  fails  before  the  protocol 
has  terminated.  Furthermore,  in  case  the  sender  fails  at  some  time  during  the 
protocol,  message  delivery  must  be  all-or-nothing .  More  precisely: 
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If  processor  p  sends  a  message  m  to  a  set  D  of  destination  sites,  then 
the  system  will  eventually  reach  one  of  the  foil*,  wing  two  states: 

1.  For  all  q  6  D:  q  has  received  m  or  q  has  crashed. 

2.  Processor  p  has  crashed,  and  for  all  q  €  D:  q  has  crashed  or  q  will 
never  receive  m. 

This  property  is  also  called  atomic  message  delivery. 

We  will  now  look  at  the  different  types  of  broadcast  protocols  introduced  in  the 
previous  section  (BCAST,  FBCAST,  CBCAST,  ABCAST)  and  examine  how  they  can  be 
made  reliable. 

2.2.1  Reliable  Beast,  Fbcast,  and  Cbcast 

The  simplest  reliable  broadcast  protocol  uses  a  method  called  flooding  or  message 
diffusion.  A  processor  wishing  to  broadcast  a  message  sends  it  to  all  destination 
sites  by  means  of  an  (unreliable)  BCAST.  Every  processor  that  receives  the  message 
forwards  it  to  all  other  destination  sites  using  BCAST.  This  way  every  destination 
will  receive  multiple  copies  of  the  message  (one  from  the  sender  and  one  from  each 
other  destination,  if  no  failures  occur);  it  forwards  the  message  only  the  first  time  it 
is  received  and  ignores  all  duplicates.  This  protocol  achieves  atomic  delivery:  Every 
processor  that  receives  the  message  will  eventually  either  succeed  in  forwarding  it 
to  all  other  operational  destinations  or  it  will  fail.  Therefore,  eventually  either 
all  operational  sites  have  received  the  message  or  all  sites  that  ever  received  the 
message  have  crashed. 

By  appending  a  set  of  sequence  numbers  to  every  message,  FIFO  ordering  can 
be  added  to  a  diffusion  protocol.  This  way  we  get  a  reliable  FBCAST. 
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The  same  technique  can  be  used  to  make  CBCAST  reliable.  The  CBCAST  protocols 
we  described  in  Section  2.1.4  work  by  piggybacking  dependency  information  onto 
the  broadcast  message  to  be  sent  out.  The  use  of  message  diffusion  to  propagate  this 
message  ensures  that  the  original  message  contents  as  well  as  the  dependency  infor¬ 
mation  are  delivered  to  all  operational  destination  sites.  This  way  causal  ordering 
can  be  preserved  despite  processor  failures.  For  details  see  [BJ87b]. 

2.2.2  Reliable  Abcast 

Abcast  is  a  form  of  consensus  protocol,  because  atomic  ordering  requires  all  pro¬ 
cessors  to  agree  on  total  order  on  all  broadcasts.  In  [FLP85]  Fisher,  Lynch,  and 
Paterson  show  that  it  is  impossible  to  achieve  consensus  in  an  asynchronous  sys¬ 
tem  if  failures  occur.  Consequently,  it  is  not  possible  to  implement  reliable  atomic 
broadcast  in  such  a  system.  The  reason  for  this  is  that  if  no  upper  bound  on 
message  delays  is  known,  a  processor  failure  is  indistinguishable  from  very  slow 
communication.  For  example,  consider  a  system  with  two  processors  pi  and  pi-  It 
is  not  difficult  to  prove  the  impossibility  result  of  [FLP85]  for  this  example:  Assume 
processor  pi  broadcasts  a  message  a  with  destination  pi,pa.  At  the  same  time  pi 
sends  a  message  6,  also  addressed  to  both  processors.  Consider  the  following  three 
scenarios: 

1.  Processor  pa  crashes  before  sending  any  messages;  p\  does  not  fail.  Then  the 
message  a  must  eventually  (say  after  some  time  interval  <fi )  be  delivered  at 
Pi- 

2.  Processor  pi  crashes  before  sending  any  messages;  pa  does  not  fail.  Then  the 
message  b  must  eventually  (say  after  some  time  interval  da)  be  delivered  at 
Pi- 


3.  Neither  of  the  two  processors  fails,  but  the  communication  network  is  very 
slow;  every  messages  takes  at  least  time  d  =  max(d\,di)  before  it  is  received. 

Up  to  time  d,  processor  pi  cannot  distinguish  Scenario  3  from  1.  In  both  cases  it 
has  not  yet  received  any  messages  from  pj ,  but  it  does  not  know  if  pi  has  crashed  or 
is  still  alive.  Therefore,  in  Scenario  3,  pi  will  deliver  message  a  after  time  dj,  before 
receiving  any  messages  from  pj.  Similarly  b  will  be  delivered  at  pj  before  p2  receives 
any  messages  from  pi .  But  then  atomic  ordering  is  violated  in  this  scenario. 

Therefore  reliable  atomic  broadcast  can  only  be  achieved  if  we  relax  the  assump¬ 
tions  about  the  asynchrony  of  the  system.  There  are  two  ways  of  doing  this: 

1.  Assume  that  failures  can  be  detected.  The  ABCAST  protocols  described  in 
[CM84]  and  [BJ87b]  achieve  reliability  under  this  assumption.  If  a  proces¬ 
sor  participating  in  an  ABCAST  protocol  experiences  a  failure,  some  other 
processor  can  take  over  and  complete  the  protocol  on  behalf  of  the  crashed 
processor. 

2.  Assume  there  is  an  upper  bound  on  message  delays.  In  this  case  a  reliable 
atomic  broadcast  can  be  implemented  by  combining  a  diffusion  protocol  with 
the  method  of  timestamps  to  achieve  atomic  ordering  [CASD84].  However, 
the  amount  of  time  that  a  processor  has  to  wait  before  a  message  can  be 
delivered  to  the  application  program  increases  with  the  number  of  expected 
failures. 

Notice  that  the  second  assumption  implies  the  first.  If  message  delays  are  bounded, 
failures  can  be  detected  by  timeouts.  In  fact,  most  failure  detection  mechanisms  in 
distributed  systems  rely  on  timeouts. 
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2.3  Summary 

We  examined  a  variety  of  reliable  broadcast  protocols  that  differ  In  the  form  of 
message  ordering  they  provide. 

•  Atomic  Broadcast  ( ABC  AST): 

All  messages  are  delivered  in  the  same  order  everywhere. 

•  Causal  Broadcast  (CBCAST): 

The  order  in  which  messages  are  delivered  is  consistent  with  the  information 
flow  relation  between  broadcast  events. 

•  Fifo  Broadcast  (fbcast): 

Broadcasts  by  the  same  processor  are  delivered  in  the  order  sent. 

•  Unordered  Broadcast  (BOAST): 

Messages  are  delivered  in  an  arbitrary  order. 

The  stronger  the  the  ordering  property  of  the  broadcast,  the  more  costly  it  is  to 
implement.  An  atomic  broadcast  protocol  requires  at  least  two  phases  of  message 
exchange,  whereas  CBCAST,  FBCAST,  and  BCAST  can  be  implemented  as  one-phase 
protocols.  Furthermore,  in  an  unreliable  system  in  which  processors  may  experience 
failures,  ABCAST  can  only  be  implemented  if  failures  are  detectable  or  if  an  upper 
bound  on  message  delays  is  known. 


Chapter  3 
Formal  Model 

In  this  chapter  we  present  a  formalism  based  on  events  and  histories  for  specify¬ 
ing  problems  in  a  distributed  system.  We  introduce  a  model  for  a  broadcast-based 
distributed  implementation  and  give  a  definition  for  the  correctness  of  an  imple¬ 
mentation  with  respect  to  a  problem  specification.  We  illustrate  our  formalism  by 
showing  that  every  formal  problem  has  an  implementation  based  on  atomic  broad¬ 
casts. 

3.1  Formal  Problem  Specifications 

A  program  running  in  a  distributed  system  consists  of  several  components,  each 
running  at  a  different  site,  and  interacting  with  each  other  by  sending  and  receiving 
messages.  A  formal  specification  for  such  a  program  can  be  given  in  terms  of 
its  inpvt/ovtpvt  behavior.  At  each  site  there  are  clients  (human  users  or  other 
programs)  that  interact  with  the  distributed  program.  This  interaction  is  typically 
described  by  a  procedural  interface.  A  client  invokes  an  operation  by  passing  the 
operation  name  and  a  set  of  parameters  to  the  component  of  the  distributed  program 
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residing  at  the  local  site.  The  program  executes  the  operation,  informs  the  client  of 
its  completion,  and  possibly  returns  a  value  to  the  client.  During  the  execution  of  the 
operation  the  local  component  of  the  program  may  interact  with  remote  components 
of  the  program.  Figure  3.1  illustrates  this  view  of  a  distributed  program. 

We  distinguish  between  the  implementation  of  a  distributed  program  and  its 
behavior  as  observed  by  its  clients.  From  a  client’s  point  of  view  the  program  is 
a  service  that  accepts  requests  from  clients  at  different  sites,  executes  each  request 
and  returns  the  result  to  the  client.  Figure  3.2  illustrates  this  view  of  a  distributed 
program  as  a  centralized  service.  We  use  this  client  view  as  the  basis  of  our  formal 
specifications. 

Definition  3.1 
A  formal  event 

e  =  Aj(xi,.. .  ,x„)  :  v 

denotes  operation  A  invoked  by  client  i  with  parameters  xj , . . . ,  x„,  and  return¬ 
ing  the  value  v.  A  formal  history 

H  =  (ei,e3,...,ew) 

is  a  finite,  totally  ordered  sequence  of  events. 

A  formal  history  describes  the  sequence  of  operations  executed  by  the  service  and 
the  values  returned  to  the  clients.  A  formal  specification  determines  what  constitutes 
correct  behavior  of  the  service,  by  defining  which  formal  histones  of  the  service  are 
legal.  Since  we  do  not  want  to  commit  ourselves  to  any  particular  logical  language  for 
describing  specifications,  we  simply  identify  a  specification  with  the  set  of  histories 
accepted  by  it. 


Figure  3.1:  A  client  interacting  with  a  distributed  program 


Figure  3.2:  Client  view  of  a  distributed  program 
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Definition  3.2 
H+e  or  He 


denotes  the  history  obtained  by  appending  the  event  e  to  H. 


HH' 

H<H' 

H\i 


denotes  the  concatenation  of  H  and  H'. 
means  that  H  is  a  prefix  of  H' . 

denotes  the  projection  of  H  onto  client  i,  that  is  the  subsequence 
of  H  containing  all  operations  invoked  by  client  i. 


Definition  3.3 

A  formal  specification  is  a  quadruple  S  —  (n,  I,  V,  S),  where  n  is  the  number 
of  clients,  /  is  a  set  of  invocations  of  the  form  A,(xi, . . .  ,xk)>  V  is  a  set  of 
return  values,  and  5  is  a  set  of  histories.  5  must  satisfy  the  following  two 
properties: 


5  is  prefix-closed:  VH$S:  VH'<H:  H’ €  S, 

S  is  complete  and  deterministic: 

V  H  €  S :  V  invocation  a  €  / : 

3  unique  return  value  v  6  V :  H  +  a:v  €  S. 


At  this  point  it  is  useful  to  give  an  example  that  illustrates  our  formalism.  We  will 
use  this  example  throughout  the  rest  of  this  dissertation.  Consider  the  problem  of 
managing  a  shared  resource  in  a  distributed  system.  The  resource  can  be  accessed 
from  any  site,  but  we  want  to  ensure  that  at  any  given  time  only  one  site  actually 
uses  the  resource.  This  problem  can  be  solved  by  introducing  the  concept  of  a  token 
that  is  associated  with  the  resource.  Only  the  site  that  is  currently  holding  the 
token  is  allowed  to  access  the  resource.  If  the  current  token  holder  no  longer  needs 
the  resource  it  may  pass  the  token  to  some  other  site.  We  want  to  design  a  token 
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passing  service  that  manages  this  token.  This  service  would  support  the  following 
operations: 

•  QUERY0:  BOOLEAN 

—  returns  TRUE  if  the  caller  is  the  current  token  holder. 

•  PASS(X:  CLIENTlD):  RETURNCODE 

—  passes  the  token  from  the  current  token  holder  to  client  x. 

The  PASS  operation  returns  one  of  three  values:  OK,  ERRORHOLDER  (the 
caller  is  not  the  current  token  holder),  or  ERRORREQUEST  (client  x  did  not 
request  the  token). 

•  REQUEST0:  RETURNCODE 

—  request  the  token. 1 

The  REQUEST  operation  returns  one  of  tV»ree  values:  OK,  ErrorHolder  (the 
caller  is  already  holding  the  token),  or  ErrorRequest  (the  caller  has  already 
requested  the  token). 

A  complete  formal  specification  is  given  in  Appendix  A.  Here  we  will  only  list  a 
few  histones  that  illustrate  this  example. 

We  assume  that  initially  the  token  is  held  by  client  1.  Consider  the  history  H\ : 

H\  =  Qi  F,  Ri’.ok,  P\(3):ok,  Qy.T 

Q,  R,  P  stand  for  QUERY,  REQUEST,  and  PASS  operations;  T  and  F  stand  for  the 
return  values  TRUE  and  FALSE.  Client  3  invokes  a  QUERY  and  finds  out  that  it  is 

^he  request  operation  is  non- blocking.  A  client  that  needs  the  token  would 
invoke  a  request  operation  and  then  repeatedly  issue  a  QUERY  operation  until  it 
returns  TRUE. 


not  holding  the  token.  It  then  decides  to  request  the  token.  Client  1  (the  initial 
token  holder)  passes  the  token  to  client  3,  and  consequently  a  QUERY  by  client  3 
returns  TRUE.  We  would  consider  this  a  legal  history,  i.e.,  H\  G  5. 

Hi  =  Qi'-F,  Ry.ok ,  Pi(3):ok,  Qy.F 

Hi  is  an  example  of  an  illegal  history:  although  the  token  has  been  passed  to 
client  3,  the  last  QUERY  returns  FALSE.  In  this  example  the  token  passing  service 
would  have  returned  the  wrong  value  for  the  QUERY;  therefore  Hi  £  5. 

Hi  =  Qz:F,  Ri.ok ,  Pi(2):ok,  Qy.F 

This  history  is  also  illegal:  client  3  is  passing  the  token  although  it  is  not  holding  it. 
The  token  passing  service  behaved  incorrectly  by  returning  OK  for  this  operation. 
It  should  instead  have  returned  the  value  ERRORHOLDER,  indicating  an  error: 

Hi  =  Qy.F ,  Ri.ok,  Pi(2):  Error  Holder,  Qi-F 

In  our  formalism  we  make  a  number  of  implicit  and  explicit  assumptions  about 
the  distributed  service. 

1.  A  client  invokes  only  one  operation  at  a  time  and  waits  for  the  operation  to 
complete  before  invoking  the  next  one. 

2.  We  assume  every  operation  can  be  executed  as  soon  as  it  is  invoked.  In 
particular  there  are  no  operations  that  explicitly  wait  until  another  client  has 
taken  a  certain  action.  In  our  formalism,  operations  with  wait  semantics  must 
be  modeled  by  a  “busy  wait”.  The  token  passing  service  for  example  does  not 
have  an  operation  WaitForToken.  Instead  we  provide  the  REQUEST  and 
QUERY  operation.  A  client  waiting  for  a  token  would  periodically  invoke  a 
QUERY  until  it  returns  TRUE.  In  Appendix  B  we  show  formally  that  any 
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operation  with  wait  semantics  can  be  modeled  in  this  way.  We  chose  this 
model  because  it  is  simpler  and  entails  no  loss  of  generality. 

3.  We  require  specifications  to  be  prefix-closed.  This  allows  us  to  decide,  at 
any  time  during  the  execution  of  a  system,  whether  the  service  has  behaved 
correctly  so  far.  In  other  words,  the  correctness  of  an  execution  up  to  a  given 
time  does  not  depend  on  any  future  events.  The  prefix  closure  of  5  also 
makes  it  unnecessary  to  consider  infinite  histories.  An  infinite,  legal  history 
is  represented  in  5  by  all  its  (finite)  prefixes.  However,  because  histories 
are  finite  and  specifications  are  prefix-closed,  our  formalism  can  only  express 
safety  properties,  not  liveness  properties  [SA85], 

4.  We  only  consider  deterministic  specifications  in  which  the  value  returned  by 
an  operation  is  determined  completely  by  the  parameters  of  the  operations 
and  by  the  previous  history.  Also,  because  specifications  are  complete,  all 
operations  are  totaL  In  other  words,  clients  are  not  restricted  to  invoke  only 
“legal”  operations.  Any  specification  can  be  made  complete  by  specifying  that 
an  operation  should  return  a  distinguished  value  ERROR  when  performed  in  a 
state  in  which  it  would  otherwise  not  be  legal  to  execute  the  operation. 

Our  specifications  differ  from  other  formal  specification  methods.  In  particular, 
we  do  not  associate  any  state  variables  with  a  service.  For  example,  consider  a 
service  that  provides  two  operations  READ  and  WRITE.  Instead  of  saying  that  a 
vv rit E  ( X = 5 )  changes  the  value  of  some  internal  variable  x ,  we  specify  the  effect  of 
this  operation  by  saying  that  the  next  operation  READ(X)  should  return  the  value 

5.  Rather  than  specifying  how  an  operation  changes  the  internal  state  of  a  service, 
we  specify  how  the  operation  affects  the  result  of  future  operations. 
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In  some  sense  these  two  approaches  for  formal  specifications  are  equivalent.  In 
our  formalism  the  current  state  of  a  service  is  represented  by  the  history  of  all 
operations  executed  so  far.  A  new  operation  changes  this  state  by  appending  an 
event  to  the  history.  We  chose  the  history-based  approach  because  it  does  not 
assume  any  specific  internal  representation  for  the  state  of  the  service. 

3.2  System  Execution  Model 

The  main  goal  of  this  dissertation  is  to  find  out  how  different  forms  of  broadcasts 
can  be  used  to  construct  a  solution  to  a  problem  that  is  specified  in  the  formalism  we 
introduced  in  the  previous  section.  In  this  section  we  present  a  model  for  studying 
broadcast-based  implementations  of  a  service. 

In  the  most  general  terms,  a  distributed  implementation  of  a  service  runs  like 
this: 

•  A  client  at  processor  t  invokes  an  operation  a. 

•  Processor  i  starts  an  agreement  protocol  among  all  processors  to  decide  on  the 
effect  of  the  operation  and  its  return  value. 

•  When  the  protocol  terminates,  the  result  is  returned  to  the  client. 

We  will  show  in  Section  3.5  that  in  order  to  obtain  an  implementation  of  any 
formally  specified  problem,  it  is  sufficient  to  have  agreement  protocol  establish  a 
global  order  on  all  the  operations  invoked  by  different  clients  in  the  system.  An 
atomic  broadcast  (ABCAST)  does  just  that.  An  implementation  based  on  abcast 
would  run  like  this: 

•  A  client  at  processor  i  invokes  an  operation  a. 
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•  Processor  i  puts  the  operation  (including  its  parameters)  into  a  message  and 
broadcasts  the  message  to  all  sites  in  the  system  (including  itself). 

•  Other  processors  that  receive  this  message  update  their  local  state. 

•  When  site  i  receives  its  own  message,  it  also  updates  its  state  and  at  that 
time  computes  the  result  to  be  returned  to  the  client. 

In  Section  3.5  we  will  make  this  more  precise  and  prove  that  such  an  implementation 
indeed  gives  a  correct  solution  for  any  specification.  In  Chapter  4  we  will  then 
explore  conditions  under  which  it  is  possible  to  replace  the  ABCAST  by  a  more 
efficient  broadcast  protocol  that  does  not  require  the  client  to  wait  for  a  multi¬ 
phase  agreement  protocol  to  finish  before  it  gets  back  its  return  value.  In  the  rest 
of  this  section  we  define  our  model  for  broadcast-based  implementations  and  give  a 
criterion  for  their  correctness  with  respect  to  a  formal  specification. 

3.2.1  Execution  Histories 

An  execution  of  a  broadcast-based  implementation  outlined  above  can  be  described 
by  a  picture  like  Figure  3.3.  The  horizontal  lines  show  events  happening  at  different 
processors.  To  simplify  the  model  we  assume  that  there  is  only  one  client  per 
processor.  There  are  two  different  types  of  execution  events2. 

1.  Invocation  events ,  which  denote  the  invocation  of  an  operation  by  the  local 
client.  An  invocation  causes  a  message  to  be  broadcast  to  all  sites.  These 
messages  are  represented  by  the  arrows  in  the  figure. 

2 Note  that  execution  events  are  different  from  formal  events  as  in  Definition  3.1. 
Definition  3.11  in  Section  3.2.2  relates  these  two  types  of  events. 
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2.  Receive  events,  which  denote  the  receipt  of  a  message  that  was  broadcast  from 
some  other  site.  The  tip  of  each  arrow  represents  a  receive  event  in  the  figure. 

Consider,  for  example,  the  events  at  processor  2  in  Figure  3.3.  The  first  event  is  an 
operation  “a”  invoked  by  the  client  at  that  site.  The  invocation  causes  the  processor 
to  send  out  a  broadcast,  which  is  represented  by  the  four  arrows  originating  from 
the  circle  at  “a”.  The  end  of  an  arrow  represents  the  receive  event  that  arises 
when  the  broadcast  message  is  delivered  at  another  (or  the  same)  site.  A  receive 
event  is  labeled  by  a  pair  of  integers;  the  first  one  designates  the  processor  that 
sent  the  broadcast,  and  the  second  one  counts  broadcasts  sent  from  that  processor. 
The  second  event  at  pj  is  the  receive  event  (2,1),  denoting  the  delivery  of  the 
first  broadcast  from  itself.  There  are  three  more  receive  events  at  pi:  (3,1),  (1,1) 
and  (3,2).  They  denote  the  delivery  of  the  first  broadcast  from  p3,  the  first  one 
from  pi,  and  the  second  broadcast  from  pj.  Below  we  describe  such  a  graphical 
representation  of  an  execution  in  formal  terms: 

Definition  3.4 

An  execution  sequence  E  =  (Ei,. . .  ,E„)  is  a  collection  of  totally  ordered  sets 
of  invocation  and  receive  events, 

E  €  ((/UlV3)*]*, 

satisfying  the  condition: 

V  invE(i,j) :  V  k:  3  unique  receive  event  (i,j)  €  £*, 
where  mv£(i,j)  denotes  the  j'th  invocation  event  in  Ei. 


We  now  introduce  some  terminology  and  notation: 
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Definition  3.5 
F[ 

<i 

invE(i,j) 
rcvE((i,j),k) 
mumE(i,j) 

mumE((i,j),  k) 

E-a 

Definition  3.6 

Given  an  execution  sequence  E  we  define  the  relation  on  the  events  in  E: 
£[i,;]  E[i,j  +1]  for  alii,;'. 

invE{i,j)  &  rcvE({i,j),  k)  for  all  t,j,  k. 

If  a  b  we  say  that  a  directly  precedes  b. 

An  execution  like  the  one  in  Figure  3.3  can  be  viewed  as  a  directed  graph  that  has 
invocation  and  receive  events  as  nodes  and  two  types  of  edges:  the  horizontal  lines 
that  connect  events  happening  at  the  same  processor  and  the  arrows  that  represent 
broadcast  messages.  The  “•£"  relation  defines  the  edges  in  the  execution  graph. 

For  an  execution  sequence  to  make  sense  we  need  to  add  a  few  more  restrictions 
to  Definition  3.4.  For  example,  we  need  a  condition  that  prevents  messages  from 
flowing  backwards  in  time. 


the  j’th  event  in  JE7, . 

the  order  of  events  in  i.e.,  E[i,j]  <,  E[i,j']  iff  j  <  j'. 

the  j’th  invocation  event  in  F,. 

the  receive  event  (i,j)  in  Ek. 

the  sequence  number  of  invE{i,j)  in  Ei, 
i.e.,  if  F[t,/]  =  invE(i,j )  then  inumE{i,j)  =  /. 

the  sequence  number  of  rcvE((i,j),  k)  in  Ek. 

i.e.,  if  E[k,t\  =  rcvE((i,j),k)  then  m\unE((i,]),k)  =  /. 

(where  a  =  invE(i,j)  is  an  invocation  event)  the  execution 
history  that  is  identical  to  E  except  that  a  and  all  its  corre¬ 
sponding  receive  events  ( rcvE((iJ ),  fc),  for  all  k)  are  deleted. 


Definition  3.7 

An  execution  history  E  =  (E\, . . . ,  is  an  execution  sequence  satisfying  the 
following  additional  conditions: 

•  Sequential  invocation:  Clients  invoke  operations  sequentially,  i.e.,  a 
client  waits  for  the  present  invocation  to  complete  before  invoking  a  new 
one: 


Vi,./:  rcvE((i,j),i)  <,  invE(i,j  +  1) 

•  Monotonicity  of  time:  The  “•£”  relation  is  acyclic  (messages  do  not 
flow  backwards  in  time): 


-<  3ei,...,em  €  E: 


el 


D  D  D 
— ►  ej  — ►  . . .  — ♦ 


em 


D 

—  ci- 


In  addition  we  may  specify  the  ordering  properties  of  the  broadcast  protocol  used 
by  giving  a  message  ordering  axiom.  For  example,  if  we  are  interested  in  systems  in 
which  an  atomic  broadcast  is  used,  we  would  specify  an  ABCAST-axiom  that  ensures 
that  all  messages  are  received  in  the  same  order  everywhere: 


Definition  3.8 

abcast  ordering  axiom: 

V i,j, VJfc,/: 

rcvE( (*,;),  k )  <k  rcv£((i',;'),  k)  r cvE((i,j),  l)  <t  rcvE((i',j'),  /). 


execution  history  that  satisfies  this  ABCAST  aioom  in  addition  to  the  require¬ 
ments  of  Definition  3.7  would  be  called  an  “ABCAST  execution  history”. 


3.2.2  Implementations 

The  previous  section  described  a  system  execution  only  in  terms  of  what  operations 
clients  invoke  and  when  messages  are  sent  and  received.  It  does  not  specify  what 
the  contents  of  these  messages  are,  how  the  recipient  processes  such  a  message,  or 
what  values  are  returned  to  the  client  as  the  result  of  an  invocation.  In  other  words, 
we  need  to  specify  what  the  program  running  at  each  site  does. 

We  do  this  by  modeling  each  processor  as  a  state  machine  that  reacts  to  input 
events  (invocation  events  or  receive  events)  by  changing  its  state  and  generating  an 
output  event  (message  to  be  broadcast  or  value  returned  to  the  client).  This  state 
machine  has  two  types  of  transition  functions  {<j>  and  rp),  corresponding  to  the  two 
types  of  input  events. 

Definition  3.9 

An  implementation  is  a  8-tuple  (n,  /,  V,  M ,  Q ,  go,  ♦,  ♦),  where 

n  the  number  of  processors  in  the  system 

/  the  set  of  operations  that  can  be  invoked 

V  the  set  of  return  values 

M  the  set  of  message  values 

Q  the  set  of  states  in  which  a  processor  can  be 

go  the  initial  state  of  all  processors 

$  =s  (fa, . . . ,  fa)  invocation  transition  functions 

fa  :  Q  x  I  — »  Q  x  M 

$  =  (fa,  ...,?/>„)  message  receive  transition  functions 

fa :  Q  x  M  — *  Q  xV 
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The  meaning  of  the  transition  functions  0,-  and  0,-  is  as  follows:  When  an  operation 
a  €  /  is  invoked  by  client  i,  processor  i  changes  its  state  from  9  to  q'  and  broadcasts 
the  message  m,  where  (q1,  m)  =  <j>i(q,a).  When  such  a  message  is  received  at  site 
j,  processor  j  changes  its  state  from  q  to  q1  ,  where  (?\y)  =  0>(9,m).  The  return 
value  for  this  operation  is  v;  at  the  site  where  the  operation  was  invoked  this  value 
is  passed  to  the  client;  at  the  other  sites  it  is  ignored.  We  will  use  superscripts 
s,m,v  to  refer  to  the  state,  message,  and  return  value  of  0  and  0,  respectively,  as 
defined  below: 

if  0,(5,  a)  =  (^m)  then  <f>i(q,a)  =  q' ,  4>?{q,a)  =  m; 
if  0,(9,  m)  ==  (q*,v)  then  0f(9."*)  =  9/»  0f(^» m)  =  v* 

Given  such  a  formal  implementation,  we  can  take  an  execution  history  and  deter¬ 
mine  what  messages  are  sent  and  what  values  are  returned  to  the  client.  We  start 
by  giving  a  definition  for  computing  the  state  of  a  processor  after  a  particular  event 
in  an  execution  history: 

Definition  3.10 

stat£[i,;j 

’ 

90  if  ;'  =  0 

0J(stat£[*,  j-lj, a)  if  £[»,;]  =  a  is  an  invocation  event 

0f(staf£[i,;-l],  m)  if  E[i,j]  =  (*,  /)  is  a  receive  event,  where 
m  =  0J*(stat£[k,inum£(k,/)-l],xnv£(A:, /)) 

Then  stat£{i,  j]  defines  the  state  of  processor  i  after  the  j’th  event  at  that 

site.  Note  that  the  monotonidty  requirement  for  execution  histories  (Definition  3.  < ) 
prevents  this  definition  from  being  circular.  It  is  now  straightforward  to  give  def¬ 
initions  that  compute  the  messages  being  sent,  the  values  returned  to  clients,  the 
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formal  events  (invocation  plus  return  value  as  in  Definition  3.1)  observed  by  clients, 
as  well  as  the  sequence  of  formal  events  that  any  particular  client  observes: 

Definition  3.11 

msgE(i,j)  =  <f>^{sUtE[i,inumE{i,j)  -  l],inv£(t,») 

valE(i,j)  =  ^f(stat£[»,  mumE((i,j),i)  -  l],msg£(i,  j)) 

eventE(i,j)  =  a:v ,  where  a  =  invE{i,j),  and  v  =  val£(»,  j). 

H[E ,  t]  =  (event£(i,  1),  even t£(i,  2), . . . ,  event£{»,  m», 
where  m  is  the  number  of  invocation  events  in  E{. 


3.3  Implementation  Correctness 

In  Section  3.1  we  defined  formal  problem  specifications  in  terms  of  totally  ordered 
histories  which  record  a  sequence  of  events  executed  by  a  centralized  service.  In 
our  model  of  distributed  implementations,  however,  there  is  no  centralized  service. 
Instead  of  one  global  history,  we  have  a  set  of  histories  H[E ,  i]  containing  the  subset 
of  events  observed  by  individual  clients.  We  consider  such  a  distributed  implemen¬ 
tation  correct  if,  to  the  clients,  its  behavior  is  indistinguishable  from  the  behavior 
of  a  centralized  service  which  performed  the  same  set  of  operations.  In  particular, 
the  implementation  must  satisfy  the  following  condition: 

For  every  execution  history  E,  it  must  be  possible  to  merge  all  H[E,  t] 
into  one  legal,  global  history  H  €  5. 

This  ensures  that  clients  cannot  distinguish  an  execution  of  the  distributed  im¬ 
plementation  from  a  centralized  one,  because  they  all  see  part  of  a  history  that 


would  have  been  generated  by  a  centralized  server.  This  correctness  condition  is 
vei_  similar  to  the  notion  of  serializability  familiar  from  database  theory  [BG81. 
Pap79]. 

This  condition  alone  is  not  enough  to  ensure  that  the  distributed  implementation 
behaves  as  one  would  expect.  We  need  to  add  a  condition  that  says  something 
about  the  relative  order  in  which  events  invoked  at  different  processors  appear  in 
the  global  history  H.  Consider  the  token  passing  example  from  Section  3.1.  One 
could  implement  the  token  service  in  the  following  trivial  way  (recall  that  client  1 
is  the  initial  token  holder): 

•  QUERY  always  returns  FALSE  if  it  is  invoked  by  any  client  other  than  client  1. 

•  If  invoked  by  client  1,  QUERY  returns  TRUE  until  client  1  passes  the  token. 

After  the  event  P\  (j )  :  ok  a  QUERY  by  client  1  always  returns  FALSE. 

This  implementation  effectively  “loses”  the  token  after  the  first  pass  operation,  be¬ 
cause  subsequent  queries  by  any  client  return  FALSE.  However,  notice  that  the  im¬ 
plementation  satisfies  the  correctness  condition  stated  above.  An  execution  history 
for  this  service  might  generate  the  following  collection  of  formal  histories  observed 
by  clients: 

H[E,1}=  Qi:Tt  ft(3 ):ok,  Qy.F,  Q\.F 
H[E,2]  =  Qy.F ,  <?j:F, Qt:F,  ...,  Qy.F 
H[E,Z}=  Q,:F,  Q,:F,Q,:F,  ...,  Q,:F 
These  three  histories  can  easily  be  merged  in  to  a  legal  history: 

H  =  Qy.T ,  Q2:F,  QS:F,  ...,  QVT,  Q2:F,  Q3:F,  Pi(3):l,  Qy.F ,  Qy.F 

In  other  words,  by  putting  the  PASS  event  (and  everything  following  it  in  H[E,  1]) 
at  the  very  end  of  H,  we  always  get  a  legal  merged  history. 
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To  solve  this  problem  we  need  to  add  a  condition  that  prevents  events  from 
being  indefinitely  deferred  in  the  merged  liistory  H.  An  event  observed  by  one 
client  should  eventually  get  a  “stable"  place  in  H.  We  have  to  define  what  we  mean 
by  “eventually” ,  since  our  execution  model  does  not  contain  real  time.  Recall  that 
we  are  assuming  an  asynchronous  distributed  system  in  which  messages  may  be 
delayed  arbitrarily.  It  could  be  that  the  broadcast  protocol  initiated  for  the  pass 
operation  terminates  quickly  at  site  1,  whereas  due  to  message  delays  it  finishes 
much  later  at  sites  2  and  3.  In  this  case  we  would  consider  the  execution  outlined 
above  an  acceptable  behavior  of  a  distributed  implementation.  Therefore  we  add 
the  following  condition: 

Once  a  broadcast  message  about  an  event  a  has  been  received  at  site  i, 
the  event  becomes  “stable"  with  respect  to  other  events  at  site  i.  That 
is,  when  we  construct  a  legal,  global  history  H  by  merging  individual 
processor  histories,  any  event  b  that  was  invoked  at  site  t  after  the 
message  about  a  was  received  at  i,  must  be  ordered  after  a  m  H. 

In  other  words,  we  allow  an  event  invoked  at  another  site  to  be  ignored  only  as  long 
as  the  message  about  it  is  still  in  transit.  In  the  token  passing  example  above,  this 
condition  says  that  as  soon  as  the  message  about  the  operation  PASSi(3)  is  received 
at  site  3,  the  next  QUERY  operation  should  return  TRUE. 

The  next  definition  sumxnerizes  our  two  correctness  conditions  for  distributed 


implementations. 
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Definition  3.12 

V  is  -  ect  XBCAST-implementation  of  specification  5  =  (n,  /,  V,  5)  iff: 

V  XBCAST  execution  history  E :  3  H  6  S : 

Correctness:  Vi:  £T|j  =  [£,  i] 

Liveness:  Vi,j,k: 

rcvE((*,j),k)  <k  inv£(k,l)  =>  event e(i  J)  <g  event £(k,l) 

Here  “XBCAST”  stands  for  the  type  of  broadcast  used  in  the  implementation.  As 
discussed  above,  the  second  condition  (liveness)  makes  sure  that  as  long  as  the 
broadcast  protocol  guarantees  that  every  message  will  eventually  be  be  delivered 
everywhere,  every  operation  invoked  by  a  client  will  eventually  be  reflected  in  op¬ 
erations  at  other  sites.  In  other  words,  liveness  of  the  broadcast  protocol  implies 
liveness  of  the  implementation. 

3.4  Externally  Observed  Histories 

Our  model  of  execution  histories  does  not  contain  any  notion  of  real  time.  This  raises 
the  question:  How  does  an  execution  history  relate  to  what  an  external  observer 
sees  during  the  execution  of  an  implementation?  In  an  asynchronous  system  it 
does  not  make  much  sense  to  talk  about  time  in  absolute  terms  (e.g.,  milliseconds). 
However,  we  can  consider  the  relative  order  —  in  real  time  —  of  events  occurring 
during  the  execution  of  an  implementation.  Imagine  an  external  observer  who  is 
able  to  monitor  all  nodes  in  a  distributed  system  simultaneously.  Such  an  observer 
would  be  able  to  determine  a  total  order  on  all  invocation  and  receive  events  at  all 
sites.  We  call  this  sequence  of  execution  events  an  external  history ,  Eext .  We  can 
make  the  following  statement  about  the  relationship  between  the  formal  execution 
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history  E  and  the  corresponding  external  history  Eext : 

1.  The  formal  execution  history  E  already  determines  a  total  order  on  all  events 
that  happen  at  the  same  processor.  Therefore,  for  all  i,  the  events  in  £, 
appear  in  exactly  the  same  order  in  Eext . 

2.  The  relative  order  of  events  at  different  processors  is  not  determined  by  E, 
except  that  a  receive  event  can  never  precede  its  corresponding  invocation 
event,  because  messages  do  not  flow  backwards  in  time. 

We  can  summarize  this  in  the  following  statement: 

The  external  history  E ^  can  be  any  total  order  on  the  events  in  E 
that  is  consistent  with 

Given  Eext  we  can  extract  an  external  formal  history  Hext  recording  all  the  formal 
events  during  the  execution  of  an  implementation  in  the  order  they  are  seen  by  the 
external  observer.  Notice  that  our  correctness  definition  does  not  imply  that  Hext 
is  always  legal.  However,  it  ensures  that  there  exists  a  legal  history  that  is  similar 
to  Hext,  as  defined  below. 

Definition  3.13 

H  is  similar  to  H'  ( H  as  H *)  iff  V  i :  H\i  =  H'  |; 

If  clients  communicate  only  through  requests  to  the  distributed  service,  then  similar 
histones  are  indistinguishable  to  all  clients.  For  a  correct  implementation  it  will 
always  be  the  case  that 

V  execution  history  E :  3  H  6  5 :  Hext  ~  H 
For  example,  an  event  a  may  logically  be  ordered  before  b  in  H  {a  <n  b),  but 
physically  a  could  be  observed  after  b,  if  a  and  b  are  events  at  different  processors. 
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Hence  the  above  statement  just  rephrases  the  requirement  that  a  correct,  distributed 
implementation  be  indistinguishable  from  a  centralized  implementation  in  which  the 
externally  observed  history  Hex^  is  always  legal. 

3.5  Abcast  Implementation 

In  Section  3.2  we  claimed  that  the  strong  ordering  properties  of  the  atomic  broad¬ 
cast  provide  enough  synchronization  between  processors  to  solve  any  problem  that 
can  be  specified  in  our  formalism.  In  this  section  we  will  prove  this  claim.  The 
purpose  of  this  exercise  is  twofold.  By  showing  that  every  problem  has  an  ABCAST 
implementation,  we  demonstrate  that  our  model  of  broadcast-based  implementation 
is  not  too  restrictive.  Furthermore,  in  the  next  chapter  we  will  use  methods  similar 
to  the  ones  in  this  section  to  construct  implementations  based  on  more  efficient 
protocols. 

Given  a  formal  specification  S  =  (n,  /,  V,  5)  we  will  construct  an  implementation 
that  satisfies  our  Definition  3.12  of  correctness  for  all  ABCAST  execution  histories. 
Figure  3.4  describes  this  implementation  informally  in  Pascal-like  pseudocode.  The 
implementation  is  essentially  a  variation  of  the  state  machine  approach  to  replication 
as  described  in  [Sch86],  The  current  system  state  is  represented  by  the  sequence  of 
all  operations  executed  so  far  (variable  lH'  in  Figure  3.4).  This  state,  as  well  as  the 
execution  of  client  requests,  is  fully  replicated.  An  operation  invoked  by  a  client  is 
broadcast  to  every  site  (including  the  one  at  which  it  was  invoked)  and  is  executed 
everywhere  when  it  is  received.  Executing  an  operation  in  a  state  H  simply  means 
adding  a  new  event  to  H  after  choosing  a  suitable  return  value  v,  such  that  the 
new  history  is  still  legal.  The  requirement  that  specifications  be  deterministic  and 
complete  ensures  that  there  is  always  exactly  one  choice  for  such  a  value.  This, 


Processor  i  runs  the  following  program: 

H  :=  empty; 
loop 

wait  for  an  invocation  by  the  local  client  or  the  receipt  of  a  broadcast; 
if  client  invoked  operation  a  then 
abcast  “a”  to  all  processors; 
else  if  broadcast  “a”  was  received  from  j  then 
pick  a  value  v,  such  that  H  +  a  :  v  £  5; 

H  :=  H  +  a  :v; 

if  j  =  t  then  return  value  v  to  the  client  end  if 
end  if 
end  loop 

Figure  3.4:  ABCAST  implementation 
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and  the  fact  that  ABCAST  delivers  all  broadcasts  in  the  same  order  at  every  site, 
implies  that  all  processors  will  agicc  on  the  same  legal  history  H  of  events  that 
have  occurred  in  the  system.  This  history  will  satisfy  the  correctness  condition  in 
Definition  3.12.  In  order  to  prove  this,  we  translate  this  implementation  into  our 
formal  execution  model. 

Because  specifications  are  deterministic  and  complete,  we  cam  define  an  execution 
function  X$  :  S  x  I  — ►  V  such  that 

V  H  €  S,  a€l§:  u  =  Xs(77,  a)  =►  H  +  a:v  €  S, 

or  in  words:  Xs  computes  the  correct  return  value  of  operation  a  invoked  in  state 
H.  Given  a  specification  5  =  (n,  7,  V,  S)  we  define  the  implementation  Ys: 

Ys  =  (n,I,VJ,(Ix  V)*,M,¥),  i.e.,  M  *  7,  Q  =  (7  x  V)\  q0  =  0, 

where  the  transition  functions  $  are  defined  as  follows:  When  operation  a  is 
invoked  at  processor  t  in  state  77,  it  does  not  change  its  state  but  broadcasts  “a”: 

When  processor  i  receives  a  message  containing  operation  a  it  executes  the  operation 
by  adding  the  event  a:v  to  its  history  77;  the  value  v  is  returned  to  the  client: 

0,(77,  a)  =  (77  +  a:v,  v),  where  v  =  Xs(77,  a). 

Lemma  3.1 

For  every  ABCAST  execution  history  £  of  Vs:  the  final  state  of  all  processors  is 
identical. 

Proof:  A  processor  state  only  changes  when  a  message  is  received  (4>*  is  the 

identity  function).  Because  of  the  ABCAST  ordering  axiom  (Definition  3.8),  all  Ex 


45 


contain  the  same  sequence  of  receive  events.  Since  all  processors  start  in  the  same 
state  qo  —  0  and  the  transition  functions  Tpi  are  identical,  all  processors  will  end  up 
in  the  same  final  state.  □ 

Theorem  3.1 

Ys  is  a  correct  ABCAST  implementation  of  specification  S. 

Proof:  We  show  that  for  every  execution  E,  the  history  Hf  given  by  the  final 
state  of  processors  in  E  is  legal  (Hf  €  5)  and  satisfies  the  correctness  and  liveness 
conditions  of  Definition  3.12.  We  do  this  by  induction  on  the  number  of  events  in 

Hr 

The  base  case,  Hf  =  0,  is  trivially  satisfied  because  an  empty  history  is  always 
legal.  This  follows  from  the  fact  that  specifications  are  prefix-closed. 

For  the  induction  step,  consider  an  execution  history  E  such  that  Hf  is  non-empty. 
Let  rcvE((i,j),  1)  be  the  last  receive  event  in  E\.  Because  of  the  ABCAST  ordering, 
rcv£((j,;),  k)  is  the  last  event  in  Ek  for  all  Jfc.  Let  E'  =  E  -  mv£{i,j )  (i.e.,  E  with 
hiv£(i,  j)  and  rcv£({»,  j),  fc),  for  all  k,  deleted).  Let  Hf  be  the  history  given  by  the 
final  state  of  processors  in  E'. 

We  first  show  that  Hf  is  legal.  By  induction  hypothesis  H'f  €  5.  Furthermore, 
Hf  a  #/  +  Q'-v,  where 

v  =  va lE(i,j)  =  Vi{H'f,a)  =  Xs(H'f,a). 

Therefore  Hf  =  H'f  +  a:v  6  5  follows  from  the  definition  of  X5.  We  complete 
the  proof  by  showing  that  Hf  satisfies  the  correctness  and  liveness  conditions  of 
Definition  3.12. 
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Correctness:  We  have  to  show  that  Hf  |,-  =  H[E,  t]  for  all  i.  By  induction  hypoth¬ 
esis  H'f  |,-  =  H[E' ,  t].  Therefore 

Hf\i  =  (H'f  +  a:v)U  =  H'f\i+a:v  =  H[E',i]  +  a:v  =  H[E,i). 

Liveness:  Let  rcv£((l,m),  k)  inv£(l' ,m').  We  have  to  show  that  events (l,  m)  < 
eventE{l' ,rn')  in  Hf.  Case  1  ( l',m ')  =  (i,j):  In  this  case  event£(l' ,m')  = 
eventE(i,j)  is  the  last  event  in  Hf,  and  therefore  event£(l,  m)  <  event  e(1',  m')  in  Hf. 
Case  2  (V,m!)  ^  (hj)-  In  this  case  the  two  events  event£(l,m)  and  event r (l1 ,  m') 
are  both  in  H'f,  and  by  induction  hypothesis  event e{1,tti}  <  event e(1' ,m')  in  H'f, 
hence  also  in  Hf.  □ 

3.6  Summary 

We  presented  two  different  models  for  a  distributed  program:  one  for  the  formal 
specification  of  the  program  and  one  for  its  implementation. 

1.  We  modeled  the  behavior  of  a  distributed  program  as  a  service  that  executes 
requests  on  behalf  of  clients.  An  execution  of  such  a  service  is  described 
as  a  sequence  of  events,  in  which  each  event  denotes  the  execution  of  one 
client  request.  We  call  such  an  event  sequence  a  formal  history.  A  formal 
specification  for  such  a  service  is  a  set  5  that  lists  all  possible  legal  histories. 

2.  Our  implementation  model  describes  a  system  as  a  collection  of  state  ma¬ 
chines.  Each  processor  reacts  to  input  events  (invocation  events  or  receive 
events)  by  changing  its  state  and  generating  an  output  event  (message  to  be 
broadcast  or  value  returned  to  the  client). 


We  then  defined  the  correctness  of  an  implementation  with  respect  to  a  formal  spec¬ 
ification  in  such  a  way  that  clients  cannot  distinguish  the  behavior  of  the  distributed 
implementation  from  that  of  a  central  server. 

To  illustrate  our  formalism  and  to  show  that  our  implementation  model  is  not 
too  restrictive,  we  demonstrated  that  any  formal  specification  has  an  ABCAST  im¬ 
plementation. 


Chapter  4 

Asynchronous  Implementations 


In  the  previous  chapter  we  saw  that  every  formal  specification  has  an  ABCAST  im¬ 
plementation.  In  this  chapter  we  address  the  main  questions  of  this  dissertation: 
Can  we  construct  more  efficient  implementations  by  using  broadcast  protocols  that 
provide  a  weaker  form  of  ordering?  For  which  kinds  of  problems  will  this  be  suc¬ 
cessful? 

We  start  by  considering  implementations  based  on  a  causal  broadcast  (CBCAST). 
We  give  a  necessary  and  sufficient  condition  for  a  specification  to  be  implementable 
with  this  type  of  broadcast.  If  such  an  implementation  exists  it  can  be  expressed  in 
a  standard  form.  Finally,  we  show  that  a  CBCAST  implementation  can  be  translated 
into  an  implementation  based  on  FBCAST  or  even  unordered  broadcasts. 

The  implementations  we  construct  this  way  can  be  characterized  as  follows: 
When  a  client  invokes  an  operation,  the  return  value  can  always  be  computed  im¬ 
mediately  horn  local  information.  This  way  the  client  need  not  wait  for  messages 
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to  arrive  at  other  sites  or  for  replies  to  make  it  back;  information  is  propagated 
asynchronously  to  other  sites.  Therefore  we  call  this  type  of  implementation  asyn¬ 
chronous. 


4.1  Causality  and  Timestamps 

In  Chapter  2  we  introduced  the  idea  of  Potential  Causality  [Lam78].  In  our  execution 
model  we  can  define  this  relation  as  follows. 

Definition  4.1 

The  information  flow  relation  on  the  events  in  an  execution  history  E  is 
the  transitive,  reflexive  closure  of 

Two  events  a,  6  that  are  not  related  under  »"  are  called  concurrent  ( af/b ). 

If  we  interpret  an  execution  history  E  as  a  directed  graph  (the  nodes  are  the  invoca¬ 
tion  and  receive  events  in  E;  the  edges  are  given  by  then  a  —►  b  if  and  only  if 
there  is  a  path  trom  a  to  t  in  this  graph.  Because  is  acyclic  (Definition  3.7)  the 
information  flow  relation  defines  a  partial  order  on  the  events  in  an  execution 
history.1  The  intuitive  meaning  of  this  relation  is  the  following.  An  event  a  can 
affect  some  other  event  b  only  if  it  precedes  b  in  this  partial  order.  In  particular,  the 
state  of  a  processor  after  an  event  b  depends  only  on  events  that  precede  b  under 
This  fact  is  expressed  in  the  next  lemma. 

however,  note  that  we  define  ►”  to  be  reflexive  (V  c  6  E:  e  — ♦  e),  contrary 
to  the  usual  definition  of  a  partial  order.  This  notational  convenience  makes  the 
later  definitions  simpler. 
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Definition  4.2 

E'  is  a  prefix  of  E  (E1  <  E)  iff 

(i)  E'  C  E 

(ii)  Vi:  Va, b£E'i:  a<ibmE'  a  <,  5  in  £; 

(iii)  Va,6s£:  b  £  E'  A  a  -»  5  =►  a  6  E' 

For  an  event  a  £  E,  we  define  F[a]  (the  prefix  at  a)  as  follows: 

(i)  £[<*]  <  E 

(ii)  V invocation  event  a'  €  E:  a'  £  E[a]  a'  — ►  a 

Lemma  4.1 

Let  £  and  E'  be  two  execution  histories  and  a  =  £[i,j]  =  £'[»,  j]  be  an  event 
occurring  in  both  histories  (a  £  E  (1  E'). 

If  E'[a)  =  £[«]  then 

(i)  stat£*(i,j]  =  sUtE[i,j] 

(ii)  msg£>(i,  l)  *  msg£(i,  1)  if  o  =  inv£(i,  1)  is  an  invocation  event 
vad£i{t,  l)  =  v&lg{i,  l)  if  a  =  rcv£((i,  /),  i)  is  a  local  receive  event 

In  other  words,  if  we  take  an  execution  history  E  and  modify  it  into  a  history  E'  in 
such  a  way  that  events  preceding  a  under  are  unchanged  (i.e.,  E[a ]  =  F'[a]) 
then  the  state  of  processor  i  after  event  a  as  well  as  the  message  sent  or  the  value 
returned  to  the  client  will  not  be  affected  by  these  modifications.  Hence  the  lemma 
tells  us  that  E[a]  contains  exactly  those  events  in  E  that  have  an  effect  on  the 
outcome  of  a. 

Proof:  By  induction  on  the  number  of  events  in  E[aj.  In  the  base  case  E[a] 
contains  only  a  single  event,  namely  o.  Then  a  must  be  the  first  event  in  i.e., 


j  =  1;  otherwise  the  event  preceding  a  at  t  would  also  be  in  £[a],  Furthermore, 
a  cannot  be  a  receive  event;  otherwise  the  corresponding  invocation  event  would 
precede  a  under  »”and  would  be  in  £[a].  Therefore  by  Definition  3.10 

sUtE[i,j]  =  stat£[*,l]  =s  #(stat£[*,0],a)  =  #(go,a). 

Because  stat^fi,  0]  =  st&tEi[i,  0]  =  go  we  have  stat£>[»,  0]  =  statE[i,  0].  Furthermore 
by  Definition  3.11 

msgE(i,  1)  =  ^(stat£[«',0),a)  =  <t>?(qo,a)  =  msgE,(i,l). 

For  the  induction  step  consider  £[a]  with  more  than  one  event.  Let  s  =  stat^fi,  j  —  l] 
and  s'  =  statE>[i,j  —  l].  If  j  =  1  (a  is  the  first  event  at  pi)  then  s  —  s'  =  go.  Other¬ 
wise  let  b  =  E[i,j  —  1]  and  V  =  E'[i,j  —  1]  be  the  events  preceding  a  at  £,  and  £|. 
Then  b  —*  a  and  V  — ♦  a;  hence  b  €  E[a],  V  €  E'[a].  Because  E'[a ]  =  E[a ]  we  have 
1/  =  b  and  £'[&]  =  £(&]  <  £[a).  By  induction  hypothesis  the  state  of  p,  after  b  is 
the  same  in  E  and  £';  hence  again  s  =  s'. 

If  a  is  an  invocation  event  then 

stitE[i,j]  =  tf(s,o)  =  #(s',a)  =  stat£»[i,;], 
msgE(i,j)  =  ^(s,o)  =  <f^{s\a)  =  msgE'{i,j). 

Otherwise  a  =  rcvE((j,l),i)  is  a  receive  event.  Let  c  =  mvE(j,  /)  and  d  =  invE/(j ,  /) 
be  the  corresponding  invocation  events  in  E  and  E1.  Then  c  — ►  a  and  d  — *•  a'.  It 
follows  that  c,d  €  £[a]  =  £*[0],  and  therefore  c  —  d  and  £'[c]  =  £[c].  By  induction 
hypothesis  msgE(j,l)  =  msgEi(j,l).  Therefore 

statE[i,j\  =  tl>f(s,msgE(i,l})  =  msgE,(i,l))  =  statE>[i,j), 
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and  if  j  =5  i  (a  local  message  was  received)  then 

valE(i,l)  =  4>i(3,msgE(i,l))  =  0,?(s\  msgE>{i,  /))  =  valE>{i,l). 

□ 

Corollary  4.1 

Let  S'  <  S  be  a  prefix  of  the  execution  history  E.  Then 
Va  =  £'[«,;]€  S': 

(i)  stat£#[*,;]  =  stat£(t,j] 

(ii)  msgEi(i,l)  —  msg£{i,l)  if  a  =  in V£(*,/)  is  an  invocation  event 
val£i(i,  l)  =  val£(»,  /)  if  a  =  rcv£((i,  /),  t)  is  a  local  receive  event 

Proof:  E'  <  E  implies  S' [a]  =  S[a]  for  all  a  €  S'.  □ 

In  Section  3.2.2  we  defined  H[E ,  tj  to  be  the  sequence  of  formal  events  observed 
by  a  particular  client  in  an  execution  S  of  an  implementation  Y.  The  »”  relation 
induces  a  partial  order  on  on  the  formal  events  in  the  union  of  all  H[E,i].  We  call 
this  partially  ordered  set  of  formal  events  derived  from  S  and  Y  a  run.  It  is  defined 
formally  below: 

Definition  4.3 

Given  an  execution  history  S  and  an  implementation  Y  we  define  the  run 
Ry(E)  to  be  the  set  of  formal  events  given  by  S,  partially  ordered  by  ►” : 

Ry(E)  »  {event g(i,j)  |  for  all  i,j} 
with  the  partial  order  »”  on  Sy(S)  defined  as 

even *£{»,;)  — ►  event £{l,m)  invg(i,j)  —*  inv£(/,  m) 
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As  in  the  case  of  an  execution  history  we  use  the  notation  event  R{i,j)  to  denote  the 
j' th  event  at  processor  t  in  a  run  R. 

In  [Lam78]  Lamport  introduced  logical  timestamps,  integers  assigned  to  each 
event  in  such  a  way  that  if  all  events  are  ordered  by  their  timestamp  this  order  is 
consistent  with  We  can  generalize  this  idea  to  timestamps  which  are  vectors 
of  integers  [Sch85].2  Such  timestamps  are  useful  for  keeping  track  of  the  partial 
order  of  events  as  the  system  executes. 

Definition  4.4 

A  timestamp  t  for  an  event  t  =  event  R(i,j)  €  R  is  a  vector  of  n  integers  with 
the  following  meaning: 

M*]  =  II  {eventR(k,l)  6  Ek  1  eventR(k,l)  —  e}  || 
i.e.,  is  the  number  of  events  at  k  that  precede  e  in  the  partial  order. 

The  following  lemma  states  that  given  only  the  timestamp  of  two  events  in  a  run 
one  can  deduce  their  order  under 

Lemma  4.2 

Let  a  =  event  R(i,j)  and  6  =  event  R(k,l)  be  two  events  in  a  run  R,  and  let  ta 
and  tk  be  their  timestamps.  Then 

a  -*•  b  4*  <«[*]  <  t»[t] 

Proof:  a  =  event R(i,j)  is  the  ;’th  event  invoked  at  site  i.  Therefore  event R(i,j')  — ♦ 
a  iff  j'  <  j,  and  hence  t,[ij  =  j. 


2The  idea  of  vector  timestamps  was  developed  independently  by  Ladin  and 
Liskov  [LL86]. 
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If  a  — ►  6  then  by  transitivity  of  event^(t,  j')  — ►  a  for  ail  j'  <  j.  Hence  fk[t]  >  j. 

Conversely,  let  /  =  t$[i]  >  j.  Then  there  are  at  least  j  events  at  processor  t  that 
precede  6  under  In  particular,  there  must  be  an  event  event R(i,jr)  — *  b  for 

some  j'  >  j.  which  implies  that  a  — ♦  cventft(i,j'),  and  by  transitivity  a  —*  b.  □ 

We  will  use  these  timestamps  in  the  implementations  we  construct  in  the  next 
section. 


4.2  Cbcast  Implementation 

The  causal  broadcast  protocol  described  by  Birman  and  Joseph  in  [BJ87b]  is  a  proto¬ 
col  that  preserves  the  information  flow  relation  between  events,  i.e.,  whenever  two 
broadcasts  61,63  are  related  under  (61  — ♦  63),  the  protocol  guarantees  that  61 
will  be  received  before  63  everywhere.  Concurrent  broadcasts  may  be  received  in 
different  orders  at  different  sites.  In  our  formalism  we  define  the  ordering  properties 
of  CBCAST  by  the  following  axiom: 

Definition  4.5 

cbcast  ordering  axiom: 

i 

(i)  Causal  ordering: 

invE(i,j)  -*  invE(l,m)  ^  Vi:  rcvE((i,j),  k)  <*  rcvE((l,m),  k) 

(n)  Immediate  local  delivery: 

V»,j:  -.3a:  mvE{i,j)<ka<krcvE({i}j),i) 

How  can  we  use  such  a  broadcast  to  construct  an  implementation  for  a  given  spec¬ 
ification  5?  Our  plan  is  to  take  the  ABCAST  implementation  from  the  previous 
chapter,  replace  the  ABCAST  by  a  CBCAST,  and  determine  under  which  condition 
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the  implementation  will  still  be  correct.  In  order  for  this  to  work  it  is  necessary  to 
make  two  more  modifications  to  the  ABC  AST  implementation: 

•  Recall  that  the  correctness  of  the  ABCAST  implementation  depended  on  the 
fact  that  all  processors  agreed  on  the  order  of  events  and  therefore  construct 
the  same  legal  history  H.  If  we  use  a  CBCAST  instead  of  ABCAST  this  will  no 
longer  be  the  case.  However,  using  the  timestamps  introduced  in  the  previous 
section  it  is  possible  for  all  processors  to  keep  track  of  and  agree  upon  the 
partial  order  of  events  during  the  execution  of  the  system.  In  other  words,  we 
replace  the  the  variable  H  in  Figure  3.4  by  a  variable  R,  containing  a  partially 
ordered  set  of  events  (a  run). 

•  Now,  in  order  to  execute  an  operation  correctly  it  is  necessary  to  relate  these 
rims  to  globally  ordered  histories  as  they  appear  in  a  formal  specification. 
For  this  purpose  we  introduce  a  function  that  maps  partially  ordered  sets  of 
events  to  totally  ordered  histories.  We  call  this  a  linearization  operator,  it  is 
defined  formally  below. 

Definition  4.8 

A  linearization  operator,  LIN  :  72.  — ►  U  {X},  is  a  partial  function"  from  runs 

to  histories,  such  that: 

(i)  LIN(9)  =  0 

(ii)  VR:  If  H  =  LIN(R)  *  X  then 

Vo:  a  €  H  a  €  R 
Vo,66f?:  a  — >  b  ^  a  <a  b 

“The  symbol  X  denotes  an  undefined  return  value,  i.e.,  LIN(R)  =  X  means  LIN 
is  undefined  on  R. 
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Processor  i  runs  the  following  program: 

R  :■  empty; 
t  :=  (0,0,..., Oj; 

<[*]  :=  i; 

loop 

wait  for  an  invocation  by  the  local  client  or  the  receipt  of  a  broadcast; 
if  client  invoked  operation  a  then 

pick  a  value  v,  such  that  LIN(R  +  a:v)  €  S 
CBCAST  ( a:v,t )  to  all  processors; 
return  v  to  the  client; 

else  if  broadcast  ( aw ,  f)  was  received  from  j  then 
R  R  ©i  a  :  u; 

tli]  :=  ‘Lil  +  l; 

end  if 
end  loop 

Figure  4.1:  CBCAST  implementation 

For  every  run  R  on  which  LIN(R)  is  defined,  LIN  linearizes  the  events  in  R  in  a 
way  that  preserves  the  partial  order  ►”  in  R.  With  these  changes,  our  CBCAST 
implementation,  in  informal  pseudocode,  looks  like  Figure  4.1. 

We  need  to  answer  two  questions. 

1.  For  which  kind  of  specifications  will  this  CBCAST  implementation  be  correct? 
We  answer  this  question  in  Section  4.2.1. 
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2.  How  general  is  this  implementation?  Perhaps  there  are  other  methods  of  con- 
s  ~ag  t  CBCAST  implementation  that  cover  a  larger  class  of  specifications. 
We  address  this  question  in  Section  4.2.2. 

In  order  to  answer  these  questions  we  need  to  translate  the  CBCAST  implementation 
from  Figure  4.1  into  our  execution  model. 

Definition  4.7 

We  use  the  following  notation: 

R  +  e  the  run  R  with  the  event  e  added  at  the  end,  i.e.,  ordered  after  all 
other  events  in  R  (Ve'  g  R  +  e:  e!  — ►  e). 

R!  <  R  R!  is  a  prefix  of  R 

K  Q  R  A  Ve,e#  €  J?:  (e  -♦  e'  A  e'  6  R!)  =»  e  €  R! . 

f2[e]  The  run  consisting  of  all  events  preceding  e  in  iZ,  i.e., 

R[e\  =  {c'  ZR  l  e'  — ►  e}3 

R!  ©t  e  the  run  R  with  the  event  e  added  and  ordered  according  to  its 
timestamp  te,  i.e.,  e  -*  eventjt{i,k)  te[i]  >  k. 

The  implementation  outlined  in  Figure  4.1  contains  a  construct 

“pick  a  value  v ,  such  that  LIN(R  4-  A,:u)  €  5” 

similar  to  the  one  use  in  Figure  3.4  (ABCAST  implementation).  In  the  abcast 
case  we  used  the  fact  that  specifications  are  complete  to  argue  that  such  a  return 
value  will  always  exist.  In  the  CBCAST  implementation  the  argument  no  longer 

3 Recall  that  we  defined  ►”  to  be  reflexive.  Therefore  the  event  e  is  always 
contained  in  f?(e] 
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holds,  because,  as  defined  in  Definition  4.6,  LIN  may  reorder  events  in  R  Q  Ai  :  v 
differently  for  different  values  of  v.  Therefore  we  must  place  the  following  restriction 
on  LIN: 

Definition  4.8 

A  linearization  operator  LIN  is  constructive  under  5  =  (n,  7,  V,  S)  iff 
Vruns  R  such  that  LIN(R)  €  S  :  V  invocations  a  €  7: 

3  a  return  value  v  €  V:  LIN(R  +  a:v)  6  5. 

If  LIN  is  constructive  under  S  we  can  define  an  execution  function  Xs.LIN  :  S  x  I  —* 
V  (similar  to  the  one  used  in  Section  3.5)  with  following  property: 

V R  such  that  LIN(R)  6  S:  Va  6  7: 
v  =  Xs,I/Jv(7Z,  a)  =>  LIN(R  +  a:v)  €  5 

Given  a  specification  S  =  (n,  7,  V,  S)  and  a  constructive  linearization  operator  LIN, 
we  define  the  implementation  Ys,LIN  as  follows: 

Ys,LIN  —  (n>  /,  V,  IxT,  (IxV)*xT,  (0,  [0, . . . ,  1, . . .  ,0]),  *,  n 
i.e.,  M  =  /  x  T,  Q  =  (7  x  V)*  x  T,  g0  *  (0,  [0, . . . ,  1, . . . ,  0]), 

where  A *  denotes  the  set  of  all  partially  ordered  subsets  of  A ,  i.e.,  (7  x  V)*  is  the 
set  of  all  runs  than  can  be  constructed  from  events  in  7  x  V.  T  =  Nn  is  the  set 
of  all  timestamps  (integer  valued  vectors  of  length  n).  The  transition  functions  fa 
and  xj/i  are  defined  as  follows.  When  operation  a  is  invoked  at  processor  i  in  state 
[72,  t],  it  does  not  change  its  state  but  broadcasts  [a:v,t,t]: 

<M[72,t],a)  =  ([72,t],[a:u,t,i]),  where  v  =  Xs,LIAr(R,a). 
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When  processor  i  receives  a  message  [a:u,s,j],  it  adds  the  event  a:v  to  its  run  R. 
and  updates  its  timestamp  vector  by  incrementing  the  j’th  component;  the  value  v 
is  returned  to  the  client: 

t],[a:u,s,j])  =  ([£  ©,  a:v,t'],v), 
where  t'\j]  =  t\j]  +  1,  =  t[k]  for  k  ^  j. 

4.2.1  Correctness  of  Cbcast  Implementation 

The  implementation  we  defined  above  has  the  property  that  every  run  Ry(E)  gen¬ 
erated  by  one  of  its  executions,  satisfies  a  property  that  we  call  local  comctntss : 

Definition  4.9 

A  run  R  is  locally  correct  under  LIN  and  S,  iff 
V  events  e  €  fZ :  LIN{R[t})  €  S 

This  property  can  be  interpreted  as  follows.  Say  e  =  a:v  is  an  event  invoked  at 
processor  i.  In  a  run  R  =  Rys,un(E)  of  a  CBCAST  execution,  the  subrun  i?[e] 
contains  exactly  those  events  that  processor  t  knows  about  at  the  time  a  is  invoked. 
This  is  because  e'  €•  J2[e]  implies  e'  — ►  e;  hence  the  CBCAST  ordering  guarantees  that 
the  message  about  e1  is  received  before  a  is  invoked.  LIN(R[c))  6  5  then  means 
that  processor  i  executes  the  operation  a  in  a  way  that  is  correct  with  respect  to 
its  local  knowledge. 

As  stated  in  the  next  theorem,  our  CBCAST  implementation  Ys,lis  will  be  cor¬ 
rect  if  the  specification  5  has  the  property  that  local  correctness  always  implies 
global  correctness  (i.e.,  LIN(R)  6  S). 
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7nci(5)  Read\:6 


Theorem  4.1 

If  LIN  is  constructive  and  satisfies 

V  runs  R:  R  locally  correct  =>  LIN(R)  6  5. 
then  Ys,lin  is  a  correct  C8CAST  implementation  of  specification  S. 

Before  we  present  the  proof  it  is  useful  to  give  an  example  for  a  specification  that 
does  not  satisfy  this  condition.  Consider  the  problem  of  implementing  a  simple 
counte. .  Clients  can  increment  the  counter  by  a  specified  amount  (inc(x))  and 
read  the  current  value  of  the  counter  (READ).  It  is  straight  forward  to  write  down 
a  specification  for  such  a  counter:  in  a  legal  history  H  every  READ  must  return  the 
sum  of  all  increment  values  of  INC  operations  preceding  the  READ  in  H.  Figure  4.2 
gives  an  example  of  a  run  R  containing  INC  and  READ  events  (events  are  represented 
by  circles,  the  partial  order  by  the  arrows  in  the  figure).  This  run  satisfies  local 
correctness.  Consider  for  example 

il[.Rea<i2:9]  =  (/nc(l)  — ♦  /nc(5)///nc(3)  — ►  Read:  9) 


There  are  two  ways  of  linearizing  this  subrun: 

LIN(R[Readi- 9])  =  (Inc2(l),  Inc\(5),  7nc3(3),  Readr.9),  or 
LIN(R[Readj:9])  =  (Inc2(l),  Inci(Z),  Inc\(5),  Read^.Q), 

which  are  both  legal  histories.  Similarly  one  can  check  that  for  every  event  e  in 
this  run,  any  linearization  of  R[t\  is  legal.  Therefore  R  is  locally  correct  no  matter 
how  LIN  is  defined.  But  R  itself  can  not  be  linearized  into  a  legal  history:  If  LIN 
orders  Inc\(5)  before  Inc j(3)  then  the  event  Readz'A  will  be  illegal;  if  Inci( 3)  is 
ordered  before  Inc\(5)  then  Read\:Q  is  illegal.  Hence  although  R  is  locally  correct 
it  does  not  satisfy  LIN(R)  6  5  (global  correctness). 

We  will  give  the  proof  for  Theorem  4.1  on  page  64  after  the  following  three 
lemmas. 

Lemma  4.3 

(i)  £'<£=>  Ry{E')  <  Ry{E) 

(ii)  Let  e  =  a:v  €  i2y(.F)[a].  Then 
Ky(£)M  =  Sy(£[«]). 

Proof:  Follows  from  Definition  4.2  and  Lemma  4.1.  Q 

The  next  lemma  makes  a  statement  about  the  state  of  a  processor  in  Y$,lin  the 
time  when  a  processor  completes  a  client  request  and  returns  a  value  to  the  client. 

Lemma  4.4 

Let  £  be  a  CBCAST  execution  history  and  R  =  EyslIf/(E).  Let  e  =  a:v  = 
even t£(i,j)  be  an  event  in  R  and  let  £[t,  1]  =  rcv£({i,j),i )  be  the  corresponding 
receive  event  in  E.  Then 

stat£[»,l]  =  (f?[e],timestamp(e)]. 


In  other  words,  at  the  tune  a  value  is  returned  to  client  i  the  state  of  p,  correctly 
reco  Is  the  timestamp  of  the  event  t  as  well  as  the  run  f2[e]  of  all  events  preceding 
e  under 

Proof:  Let  stat£[i,j]  =  [r,t]. 

(i)  We  will  first  show  that  r  and  f2[ej  contain  the  same  set  of  events.  Let  e'  = 
even tE(i',j')  €  r.  Then  e'  was  added  to  r  when  pi  received  a  message  m  =  [e\  t',  i'] 
from  processor  i'.  Therefore 

rcvE({i',j')ti)  <,  £[*,/]  =  rcvE((i,j),i) 

Because  of  immediate  local  delivery  (CBCAST  axiom,  Definition  4.5)  this  implies 
rcvE((i',j'),i)  <i  invE(i,j) 

Therefore  invE(i',j')  -»  invE(i,j).  By  Definition  4.3  e'  ->  t  in  R\  hence  e'  6  /2[e). 
This  shows  r  C  J?[e]. 

Conversely  consider  e'  =  even tE(i',j')  6  i?[c];  then  e'  — »  e  in  R.  By  Definition  4.3 
invE(i',j')  ~ *  invE{i,j).  Because  of  causal  ordering  under  the  CBCAST  axiom 

rcvE( (*',;'),*)  <i  rcvE((i,j),i). 

In  other  words,  pi  receives  the  message  m  =  [e',  tf, »']  from  processor  i'  before  £[t,  /]  = 
fcv£((t,j),i).  Hence  the  event  e!  will  have  been  added  to  r  by  that  time,  i.e.,  e'  €  r. 
This  shows  that  f?[e]  C  r.  We  conclude  that  (as  unotdered  sets  of  events)  r  =  /?[e]. 

(ii)  Next,  we  show  t  =  timestamp(e).  The  vector  component  t\j]  is  incremented 
each  time  p;  receives  a  message  from  pj.  Therefore 

t\j]  =  ||  {e'  €  r  |  e'  is  an  event  invoked  at  pj}  || 
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Part  (i)  of  the  proof  implies 

{e'  €  r  |  e'  is  an  event  invoked  at  p;  } 

=  {e!  €  f2(<]  |  e'  is  an  event  invoked  at  pj} 

Therefore  by  Definition  4.4,  t  =  timestamp(e) 

(iii)  Finally,  we  show  that  the  partial  orders  in  r  and  JZ[e]  are  identical.  This  follows 
immediately  from  (i)  and  (ii),  because  in  r  events  are  ordered  by  timestamp.  □ 

Lemma  4.5 

If  LIN  is  constructive  and  satisfies 

Vruns/2:  R  locally  correct  =>  LIN(R)£S. 
then  for  every  CBCAST  execution  history  E:  RySLIN(E)  is  locally  correct. 

Proof:  Let  Y  —  Ys,LIN •  We  proceed  by  induction  on  the  number  of  events  in  E. 
The  base  case,  E  =  0,  is  trivially  satisfied,  because  an  empty  run  is  always  locally 
correct. 

Induction  step:  consider  E  ^  0.  We  have  to  show  that  LIN(Ry(E)[e\)  £  S  for 
every  e  in  Ry(E).  Let  a  =  inv£(i,  j)  be  an  invocation  event  in  E,  and  let  e  =  a:v  = 
event  E(i,j). 

Define  E'  =  £[a],  and  E”  =  F  -  o.  By  Lemma  4.3  R[e J  =  Ry(E')  =  Ry(E")  +  a:v. 
By  induction  hypothesis  Ry(E")  is  locally  correct  and  therefore  by  assumption 
LIN(Ry(E"))  €  5. 

By  Lemma  4.4  Ry(E")  is  equal  to  the  “iZ-part”  of  the  state  statjr[i,;]  of  processor 
i  at  the  invocation  event  a.  The  message  it  sends  is  determined  by  4>: 

m  =  <#n(fZy(£"),a)  =  [a:t/,f,i],  where  v'  =  Xs,LIN(RY(E"),a)- 
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Therefore  v  =  v'  =  Xs,Lhw{R[c],a).  From  the  definition  of  Xs,LIS  it  follows  that 
£AV(.S[e])-  LI.V(Ry(E")  +a:v)  6  S.  □ 

We  now  have  all  the  necessary  tools  to  prove  the  theorem  about  the  correctness  of 
the  CBCAST  implementation. 

Proof  of  Theorem  4.1:  We  show  that  under  the  assumption  of  Theorem  4.1 
(local  correctness  of  R  implies  LIN(R)  g  5),  for  every  CBCAST  execution  history 
E,  the  history  H  —  LIN(Ry(E))  €  5  and  satisfies  the  correctness  and  liveness 
conditions  of  Definition  3.12. 

(i)  By  Lemma  4.5  Ry(E)  is  locally  correct  and  therefore  by  assumption 

H  =  LIN(Ry(E))  €  S. 

(ii)  Correctness:  We  have  to  show  that  H\%  =  H[E,i\  for  all  t. 

H  |i  contains  the  same  set  of  events  as  H[E,i],  because  H  =  LIN(Ry{E)).  Fur¬ 
thermore,  the  order  <,•  on  E%  is  preserved  in  H,  because  e  <,-  e'  implies  e  — ►  e\  and 
LIN  preserves 

(in)  Liveness:  Let  rcv£((/,m), k)  <*  inv£(l\m'). 

We  have  to  show  that  event£(/,m)  <  event e(1' ,m')  in  H.  This  follows  from 
the  fact  that  LIN  preserves  because  rcv£((/,m),fc)  <*  inv£(/',m')  implies 

event e (l,  m)  —*  event □ 

4.2.2  Existence  of  Cbcast  Implementation 

Theorem  4.1  in  the  previous  section  gave  a  sufficient  condition  for  a  specification  5 
to  be  implementable  with  CBCAST.  In  this  section  we  will  show  that  this  condition 
is  not  only  sufficient  but  also  necessary.  This  will  show  that  our  CBCAST  implemen¬ 
tation  really  is  the  most  general  implementation  based  on  CBCAST:  every  problem 


that  has  a  CBCAST  solution  is  solvable  with  our  implementation.  In  other  words, 
the  goal  of  this  section  is  to  prove  the  following  theorem. 

Theorem  4.2 

A  specification  5  has  a  CBCAST  implementation 

3  constructive  linearization  operator  LIN : 

V  runs  R :  R  locally  correct  =►  LIN(R)  6  5. 

The  only  if  (“<=”)  direction  is  equivalent  to  Theorem  4.1  which  we  proved  in  the 
previous  section.  So,  our  task  is  the  following:  Given  some  CBCAST  implementation 
Y  of  5,  we  have  to  derive  a  linearization  operator  that  satisfies  the  conditions  of 
Theorem  4.2.  We  will  do  this  as  follows:  Given  a  run  R  our  linearization  operator 
will  map  this  run  to  a  history  in  two  steps: 

R  -  E  -  H 

In  the  Erst  step,  we  map  a  run  R  to  an  execution  history  E  that  has  the  same  set  of 
invocations  as  R  and  the  same  partial  order  as  R.  The  second  mapping  is  defined 
in  terms  of  the  behavior  of  implementation  Y.  If  Y  is  correct  then  there  must  exist 
a  legal  history  H  that  satisfies  the  correctness  and  liveness  conditions  with  respect 
to  E  (Definition  3.12).  We  map  E  to  this  history. 

The  first  mapping  is  called  I\  We  want  to  define  this  mapping  in  such  a  way 
that  it  preserves  the  partial  order  on  R.  Given  R  we  get  a  partially  ordered 
set  of  invocation  events  simply  by  ignoring  the  return  values  of  the  formal  events  in 
R.  The  execution  history  E  =  r(fZ)  will  have  exactly  these  invocation  events  plus 
all  the  corresponding  receive  events.  Notice  that  in  a  CBCAST  execution  history 
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the  u— *”  relation  between  invocation  events  already  determines  the  order  of  receive 
events  relative  to  invocation  events  in  each  processor  history  £,•: 

rcvE((kJ),i)  <i  invE(i,j)  &  invE(k,l)  -» invE(i,j) 

The  “=»”  direction  follows  from  the  definition  of  the  other  direction  from  the 
CBCAST  ordering  axiom  (Definition  4.5).  Therefore,  to  completely  determine  E  = 
T(R)  we  only  need  to  specify  the  order  of  the  receive  events  between  two  invocation 
events.  The  CBCAST  ordering  axiom  already  determines  a  partial  order  on  these 
events;  to  define  E  —  r(f?)  we  can  pick  any  linearization  of  this  partial  order.  A 
topological  sort  will  suffice.  The  next  definition  summarizes  this  procedure. 

Definition  4.10 

The  function  F  :  R  -»  £  maps  a  run  R  to  an  execution  history  E  in  the 
following  way: 

(i)  Ei  =  { a  |  3  j,v:  a:»  =  eventjt{i,j)  €  f*} 

U  {( fc, /)  |3Jb,I:  event R(k,  l)  G  f?} 

(ii)  <i  is  the  topological  sort  of  the  partial  order  on  the  events 

in  Ei  induced  by  on  R. 


The  next  lemma  formally  states  the  properties  of  this  mapping. 

Lemma  4.6 

Let  E  =  T(R).  Then 

(i)  E  is  a  CBCAST  execution  history. 

(ii)  inv£(t,ji)  — *  wvE(i',j')  in  E  event R(i,j)  —>  event R{i' ,j')  in  R. 

(hi)  R!  <R  =>  r(Jff)  <  T{R). 
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Proof:  (i)  and  (ii):  As  discussed  above,  we  constructed  T  in  such  a  way  as  to 
satisfy  these  two  properties. 

(iii)  Follows  from  property  (i)  and  Definition  4.10(ii).  □ 

We  now  use  T  to  define  a  linearization  operator  LIN  derived  from  an  implementa¬ 
tion  of  5. 

Definition  4.11 

Let  Y  =  (n,  I,  V,  M,  Q,  qo,  it)  be  a  CBCAST  implementation  of  5. 

H(R)  =  {H  €  S  |  H  is  a  linearization  of  f?y(r(f?))} 

[  “smallest”  H  in  H(R)  if  R  =  £y(T(/2)) 

LINY(R)  =  l 

(  i.  otherwise 

Notice  that  this  definition  implicitly  assumes  that  H(R)  is  non-empty  whenever 

R=RY(T(R)). 

Lemma  4.7 

If  Y  is  a  correct  implementation  of  5  then  R  —  Ry(T(R))  implies  H(R)  0, 
hence  LINy  is  well  defined. 

Proof:  Let  £  =  T(fl).  If  Y  is  correct  then  there  exists  a  history  H  €  5  that  satisfies 
the  correctness  and  liveness  condition  of  Definition  3.12.  Correctness  and  liveness 
imply  that  H  is  a  linearization  of  Ry(E).  Therefore  H  €  'H(R)  if  R  =  Ry{E).  □ 

We  now  proceed  to  show  that  if  Y  is  a  correct  CBC.AST  implementation  of  5  then 
LINy  indeed  satisfies  the  conditions  of  Theorem  4.2. 

Lemma  4.8 


LINy  is  constructive. 
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Proof:  Let  R  be  a  run  such  that  LINy(R)  €  S.  We  have  to  show  that  for  every 
invocation  a  6  I  there  exists  a  retui  ■>  value  v  such  that  LINy(R  +  a:v)  6  5. 

Let  E  =  T (R).  If  LINy(R)  £  _L,  Definition  4.11  implies  R  =  Ry(E).  Consider  an 
execution  E'  that  is  identical  to  E,  except  that  it  has  one  more  invocation  event  a 
at  the  end,  i.e., 

E\  —  E,-  +  a  +■  (i,j  +  1)  where  j  is  the  number  of  invocation  events  in  E, 
E'k  =  Ek  +  (iJ  +  1)  {or  k^i 

Let  R!  =  Ry(E').  Then  Rl  =  A  +  a:v,  where  v  =  va \l&(i,j  +  1).  By  construction 
E'  =  T(R1).  Therefore,  by  Definition  4.11  LINy(R  4-  o:t>)  =  LINy(Rl)  €  S.  □ 

Lemma  4.9 

V runs  A:  R  locally  correct  =>  LINy(R)  6  5. 

Proof:  Induction  on  the  number  of  events  in  R.  The  base  case,  R  =  0,  is  trivially 
satisfied,  because  an  empty  history  is  always  legal. 

For  the  induction  step  consider  R  j*  0.  We  have  to  show  that  LINy(R)  €  S, 
which,  by  Definition  4.11,  is  equivalent  to  R  —  Ry(E),  where  E  =  T(i?).  Let 
a:v  =  event R(i,j)  6  R  be  an  event  in  R  which  corresponds  to  the  invocation  event 
a  =  invE{i,j)  in  E.  To  prove  that  R  -  Ry(E)  we  have  to  show  that  vaif(i,;)  =  v. 

Let  e  be  a  maximal  event  in  A,  and  define  R!  =  A[e]  and  A"  =  R  —  {e} .  Local 
correctness  of  A  implies  that  LlNy(R!)  6  S  and  that  A"  is  locally  correct.  By 
induction  hypothesis  LINy(R?)  6  5.  Therefore,  by  Definition  4.11,  R!  =  Ry(E') 
and  A"  =  Ry(E"),  where  E'  =  T(A)  and  E"  =  IT(A'). 

Case  1  a:v  —  e:  In  this  case  a:v  6  A.  Because  A  <  A  we  have  E'  <  E  (Lemma  4.6). 
By  Lemma  4.1,  val£(t,j)  =  val£»(t,;}  =  v. 
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Case  2  a:v  ^  e:  In  this  case  a:v  e  R" .  Because  R"  <  R  we  have  E”  <  E 
(Lemma  4.6).  Again,  by  Lemma  4.1  val£(i,j)  =  val£»(t,j)  =  u.  □ 

Proof  of  theorem  4.2:  The  “<=”  direction  is  equivalent  to  Theorem  4.1.  The 
“=>”  direction  follows  from  Lemma  4.8  and  Lemma  4.9.  □ 

4.3  Beast  and  Fbcast  Implementation 

In  this  section  we  consider  the  problem  of  constructing  implementations  based  on 
unordered  or  FIFO  broadcasts  (BOAST  and  FBCAST).  We  start  by  investigating 
BOAST  implementations. 

The  CBCAST  protocol  presented  in  in  [BJ87b]  implements  causal  ordering  on  top 
of  unordered  message  channels  by  a  method  called  “piggybacking”.  Every  broadcast 
message  is  augmented  by  previous  messages  it  might  depend  on  before  the  message 
is  sent  out.  This  way  causal  ordering  can  be  achieved  without  multiple  phases  of 
message  exchanges.  We  use  this  idea  to  translate  our  CBCAST  implementation  from 
Figure  4.1  into  an  equivalent  BOAST  implementation  of  the  same  specification.  As 
in  the  CBCAST  implementation  every  processor  keeps  track  of  all  events  and  their 
partial  order.  When  client  i  invokes  an  operation  a,  processor  :  not  only  broadcasts 
the  event  t  =  a:v,  but  also  the  whole  set  of  events  that  precede  e  under  An 
informal  description  of  the  BOAST  implementation  is  given  in  Figure  4.3.  In  the 
rest  of  this  section  we  will  translate  this  implementation  into  our  formal  execution 
model  and  show  that  it  is  correct  under  exactly  the  same  conditions  for  5  and  LIN 
as  the  CBCAST  implementation. 


Processor  :  runs  the  following  program: 

R  :=  empty; 
loop 

wait  for  an  invocation  by  the  local  client  or  the  receipt  of  a  broadcast; 
if  client  invoked  operation  a  then 

pick  a  value  v ,  such  that  LIN(R  +  a:v)  6  S 

R  :=  R  4-  a:v ; 

bcast  (R)  to  all  processors; 
return  v  to  the  client; 
else  if  broadcast  ( R> )  was  received  then 
R  :=  R  U  #; 
end  if 
end  loop 

Figure  4.3:  BCAST  implementation 
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Given  a  specification  S  =  ( n,I,V,S )  and  a  constructive  linearization  operator 
LIN,  we  define  the  implementation  Ys.lin  as  follows: 

Ys,uN  =  (n,  /,  V,  (/  x  V)#  x  V,  (/  x  V)*,  0,  *,  *), 
i.e.,  M  =  (I  x  V)*  x  V,  Q  =  (I  x  V)#,  4o  =  0, 

where  ( I  x  V)#  denotes  the  set  of  all  runs  than  can  be  constructed  from  events  in 
I  x  V.  The  transition  functions  <£,•  and  are  defined  as  follows.  When  operation  a 
is  invoked  at  processor  >  in  state  R,  it  adds  the  event  a:v  to  its  run  R,  and  broadcasts 

[jR  +  a:u,u]: 

<t>i(R,  a)  =  (R  +  a:t>,  [R  +  a:v,  u]),  where  v  =  Xs,lin{R,  a) 

When  processor  i  receives  a  message  [i?,  v],  it  adds  all  events  in  R!  to  its  run  R, 
and  the  value  v  is  returned  to  the  client: 

The  next  lemma  makes  a  statement  about  the  state  of  a  processor  in  Ys.lin 
at  time  when  a  processor  completes  a  client  request  (i.e.,  returns  a  value  to  the 
client). 

Lemma  4.10 

Let  E  be  a  BCAST  execution  history  and  R  =  Rys,Cis(E )•  Let  e  =  a:v  = 
eveat£{i,j)  be  an  event  in  R  and  let  E[i,/]  =  ixiv£(i,j)  be  the  corresponding 
invocation  event  in  E.  Then 

/]  =  E[e] 


Proof:  Let  stat£(»,  j]  =  r.  Proof  by  induction  on  the  number  of  events  in  E[e]. 
Base  case:  f?(e]  =  {e}.  Then  a  =  mvgi i,j)  is  the  first  invocation  event  in  E;. 
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Furthermore,  there  cam  be  no  receive  events  preceding  a  in  otherwise  the  corre¬ 
sponding  formal  events  would  be  in  rt[e].  Therefore  the  state  of  p;  before  the  event 
e  is  50  =  0-  Hence  r  =  0te  =  {e}. 

Induction  step:  consider  r  with  more  than  one  event.  Let  /  €  r.  Then  e'  was 
added  to  r  when  p,-  received  a  message  m  =  r7  with  /  G  r7  from  processor  i'.  Let 
e'  =  event  E(i'  ,j')  be  the  invocation  that  caused  p,v  to  send  this  message.  Then 
t’  -*  e,  because  rcvE((t' ,j'}, i)  <,  inv£(i,j}.  Furthermore,  by  induction  hypothesis 
r7  =  i?[e7];  hence  /  — ►  e7.  By  transitivity  /  — ►  e.  Therefore  /  €  /2[e].  This  shows 
f?[e]  C  r. 

Let  e7  =  eventE(i'  ,j')  €  f2[e],  i.e.,  e7  — ►  e.  Then  invE(i',j')  — ►  invE(i,  j).  If  t'  =  t 
then  e7  was  added  to  r  at  the  invocation  invE{i,j')-  hence  e7  G  r.  Otherwise,  by 
definition  of  u~*n(4.1)  there  must  be  a  receive  event  rcvE({i" ,j"),i)  G  Ei  such  that 

invE(i',j')  ->  invE{i",j")  rcvE((i" ,j"),i)  <;  invE(i,j). 

Then  e7  — *  e"  —  event E{\" ,  j") .  By  induction  hypothesis  the  state  r"  of  py<  after 
the  invocation  event  invE(i",j")  is  equal  to  /?[«"].  Therefore  e7  G  r77  =  msgE(i" ,j"). 
Therefore,  when  p,-  receives  the  message  from  py  it  contains  e7  which  will  then  be 
added  to  r.  This  shows  r  C  f?(e].  We  conclude  that  r  =  f?[e].  □ 

This  lemma  is  the  equivalent  of  Lemma  4.4  for  CBCAST  implementations. 
Theorem  4.3 

If  LIN  is  constructive  and  satisfies 

V  runs  R :  R  locally  correct  =►  LIN(R)  G  5. 
then  Ys,ljn  is  a  correct  BCAST  implementation  of  specification  5. 
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Proof:  The  proof  is  the  same  as  for  Theorem  4.1  with  references  to  Lemma  4.4 
replaced  by  Lemma  4.10  □ 

Theorem  4.4 

A  specification  5  has  a  BCAST  implementation 

3 constructive  linearization  operator  LIN: 

V  runs  R :  R  locally  correct  =>  LIN(R)  6  S. 

Proof:  The  u<=”  direction  is  equivalent  to  the  previous  theorem  (4.3).  The  “=>" 
direction  follows  from  theorem  4.2,  because  every  BCAST  implementation  for  S  is 
also  a  CBCAST  implementation.  □ 

Corollary  4.2 

A  specification  5  has  a  FBCAST  implementation 

3  constructive  linearization  operator  LIN : 

V runs  R:  R  locally  correct  =>  LIN(R)  €  5. 

Proof:  The  “4=”  direction  follows  from  theorem  4.-*,  because  every  BCAST  im¬ 
plementation  for  S  is  also  a  FBCAST  implementation.  The  “=>”  direction  follows 
from  theorem  4.2,  because  every  FBCAST  implementation  for  5  is  also  a  CBCAST 
implementation.  □ 


4.4  Summary 


In  this  chapter  we  looked  at  the  problem  of  constructing  am  implementation  for  a 
formal  specification  using  broadcast  protocols  that  are  more  efficient  than  abcast. 
Let  <5  be  the  class  of  all  formal  specification  and  let  S xbeaat  be  the  the  subset  of  S 
containing  all  specifications  that  have  an  XBCAST  implementation  (where  XBCAST 
stands  for  ABCAST,  CBCAST,  . . . ).  We  have  shown  that  S  separates  into  two 
distinct  subclasses: 

^  ~  S abcast  >  beast  =  & fbcast  ~  & beast- 

We  call  the  second  class  S  async  because  specifications  in  this  class  have  implemen¬ 
tations  with  the  following  characteristic.  When  a  client  invokes  an  operation  it  is 
always  possible  to  compute  the  return  value  immediately  from  local  information. 
This  way  the  client  never  has  to  wait  for  replies  from  remote  sites;  information  is 
propagated  asynchronously  in  the  background. 

We  showed  that  a  specification  5  is  a  member  of  the  class  *5 async  if  and  only 
if  there  exists  a  linearization  operator  for  5  (Theorem  4.2).  This  linearization 
operator  can  be  used  to  automatically  construct  an  implementation  for  5.  In  the 
next  chapter  we  will  look  at  the  problem  of  finding  such  an  operator. 


Chapter  5 

Commutative  Specifications 


In  the  previous  section  we  gave  a  complete  characterization  for  the  class  of  specifica¬ 
tions  that  have  an  asynchronous  implementation.  Unfortunately  as  we  will  show  in 
the  next  section,  this  class  is  non-recursive,  i.e.,  in  general  the  question  of  whether 
a  specification  has  an  asynchronous  implementation  is  undecidable.  This  result 
shows  that  we  cannot  find  a  general  algorithm  that  would  automatically  construct 
a  suitable  linearization  operator  from  a  given  specification.  Therefore,  we  have  to 
investigate  methods  that  could  be  applied  to  certain  “simple”  subclasses  of  speci¬ 
fications.  In  this  chapter  we  explore  the  possibility  of  exploiting  knowledge  about 
the  commutativity  of  operations  in  a  specification  in  order  to  construct  linearization 
operators. 

5.1  Undecidability 

Theorem  4.2  reduces  the  problem  of  constructing  an  asynchronous  implementation 
for  a  specification  5  to  the  problem  of  finding  a  linearization  operator  LIN  that 
satisfies  the  condition  of  Theorem  4.2.  Unfortunately  this  problem  is  still  very  hard. 
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In  fact,  the  example  below  shows  that  the  general  problem  of  deciding  whether  a 
specification  5  has  an  asynchronous  implementation  is  undecidable. 

Consider  a  system  with  two  processors  in  which  client  1  may  invoke  a  parame¬ 
terless  operation  a;  client  2  may  invoke  an  operation  b  with  one  integer  parameter. 
Define  the  following  class  of  specifications: 

S{  =  (2,  /,  V,  Si),  where 
I  =  {ai}  U  {6j(x)  |  xeN) 

V  =  {0,1} 

Si  =  {  (<*i  :0)  } 

U  {  (6,(x):0)  I  x  €  N) 

U  {  (ai:0,6j(x):0)  |  x  $  hi} 

U  {  (ai:0, 6j(x):l)  |  x  €  h,} 

U  {  (&j(x):0,ai:l)  |  x  €  iV} 

where 

hi  =  {x  |  x  is  an  encoding  of  a  computation  of  the 
t’th  Turing  machine,  J1,,  in  which  Ti  halts} 

Lemma  5,1 

Si  has  an  asynchronous  implementation  iff  the  Turing  machine  Ti  never  halts. 

Proof:  Let  LIN  be  a  linearization  operator  that  satisfies  the  condition  of  Theo¬ 
rem  4.2  for  S{.  Consider  the  run 


R=  ai:0 // 6j(*):0 
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Table  5.1:  Linearization  operator  for  S,- 


R 

LIN(R) 

ai:0 

(ai:0) 

<*i:l 

-L 

bi(x):0 

for  all  z  6  N 

bi(x):l 

1 

for  all  x  6  N 

<*i:0  //  hi(x):0 

(ax:0,  bi(x):0) 

for  all  z  €  N 

ay.u  //  bi(x):v 

X 

ifu^Ooru^O 

ai:0  -*  bi(x): 0 

(ai:0,  bi(x):0) 

for  all  x  €  N 

ai:ti  — *  bi(x):v 

X 

ifu^Ooru^O 

bi(x):0  — *  a\:l 

(bi(x): 0,  ai:l) 

for  all  z  6  N 

bi(x):u  — *  ax:u 

X 

ifu^Oorv^l 

all  other  R 

X 

for  some  x  €  N.  If  LIN  is  constructive  then  LIN  {9)  =  0  €  S,  implies  that  there 
exists  a  return  value  v  such  that  LIN(a:v)  €  S,\  The  way  Si  is  defined  this  is 
only  possible  if  v  =  0.  Hence  LIN{ay.O)  =  (oi:0)  6  S<,  and  by  the  same  argument 
LIN(bi(x):Q)  =  (bj(x):0)  €  S,\  Therefore  the  run  R  satisfies  local  correctness.  By 
Theorem  4.2  LIN{R)  6  S,\  This  is  only  possible  if  LIN(R)  =  (ai:0,  &2(x):0),  and 
z  hj.  But  f?  was  locally  correct  for  any  z.  Hence  hi  =  0,  i.e.,  T,  never  halts. 

Conversely,  assume  that  T,-  never  halts.  Define  a  linearization  operator  LIN  by 
Table  5.1.  The  way  Sj  is  defined,  a  legal  history  has  at  most  one  event  at  p\  and  one 
event  at  pi.  Consequently  a  locally  correct  run  can  have  only  two  events.  Therefore 
our  table  enumerates  all  possible  locally  correct  runs.  It  is  straight  forward  to  check 


that  if  hi  =  0  then  for  every  row  in  the  table,  either  the  history  in  the  right  column 
is  legal,  or  the  run  in  the  left  column  violates  local  correctness.  □ 

The  lemma  shows  that,  if  we  had  a  procedure  for  deciding  if  Sj  is  simple  for  a  given 
t,  then  we  would  have  solved  the  halting  problem  for  Turing  machines.  But  since 
the  halting  problem  is  undecidable  we  have: 

Corollary  5.1 

The  problem  of  finding  those  t  for  which  Si  is  simple  is  undecidable. 

Fortunately,  hardly  any  problem  that  arises  in  real  distributed  systems  has  anything 
to  do  with  Turing  machine  computations.  In  many  cases  the  problem  at  hand  can 
be  solved  despite  the  undeddability  of  the  general  case. 

5.2  Commutative  Specifications 

The  difference  between  an  ABC  AST  execution  history  and  a  CBCAST  execution  his¬ 
tory,  is  that  in  the  CBCAST  case  different  processors  may  observe  events  in  different 
orders.  Therefore,  it  should  be  easier  to  construct  a  CBCAST  implementation  if 
certain  events  commute,  that  is,  if  their  order  in  a  legal  history  can  be  reversed 
without  making  the  history  illegal.  We  explore  this  idea  in  this  section. 

5.2.1  Commutativity  and  Ordering  Constraints 

We  start  by  defining  an  equivalence  relation  on  histories.  Two  histories  are  equiva¬ 
lent  if  no  sequence  of  future  events  can  distinguish  them. 
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Definition  5.1 

Two  histories,  H  and  Hi  are  equivalent  (H\  =  Hi)  iff 
V£T:  HiHeS  H2H  6  S. 

We  can  identify  the  equivalence  classes  of  histories  with  the  states  the  system  can 
be  in.  Different  histories  in  the  same  equivalence  class  represent  different  ways  of 
reaching  the  same  system  state.  We  use  this  equivalence  relation  to  distinguish 
between  read-only  events  and  update  events.  An  event  is  a  read-only  event  if  it  does 
not  change  the  system  state. 

Definition  5.2 

An  event  e  is  read-only  iff 

V  H:  HeeS  =>  H  =  He. 

Events  that  are  not  read-only  are  called  update  events. 

Note  that  whether  a  particular  operation  is  read-only  depends  on  the  outcome  (i.e., 
return  value)  of  the  operation.  Consider,  for  instance,  the  PASS  operation  from 
our  token  passing  example.  The  event  Pi(x):ok  is  an  update  whereas  Pi(x):eH  is  a 
read-only  event. 

We  now  turn  our  attention  to  specifications  in  which  update  events  always  com¬ 
mute. 

Definition  5.3 

t 

Specification  5  is  commutative  iff 

V  H :  V  a,  6  update  events  at  different  processors: 

HaeS  A  Hb€  S  =»  Hab€  S  A  Hba  €  S  A  Habs  Hba. 


Clearly,  read-only  events  always  commute  with  each  other,  but  read-only  events 
may  or  may  not  commute  with  update  events.  Consider  Ha,  Hb  6  S,  where  a  is 
read-only  and  b  is  an  update  event.  Then  Hab  6  S,  otherwise  a  would  not  be 
read-only.  It  could  be  that  also  Hba  €  S;  this  would  mean  that  the  return  value  in 
a  is  not  affected  by  the  update  b.  Otherwise,  if  Hba  £  S ,  then  a  is  affected  by  b , 
i.e.,  the  return  value  in  a  is  no  longer  valid  if  a  is  ordered  after  b.  We  denote  this 
kind  of  dependency  between  two  events  by  the  symbol 

Definition  5.4 

a  is  invalidated  by  b  (a  b)  iff 

3  H :  Ha  €  S  A  Hb  e  S  A  Hba  £  S. 

If  ei  i-»  e2  or  e2  *-*  ei  we  also  say  that  there  is  an  ordering  constraint  between  the 
two  events.  From  the  above  discussion  it  is  dear  that 

Lemma  5.2 

If  5  is  commutative  then 

a  b  =>  a  is  read-only  and  6  is  an  update  event. 

As  an  example  let  us  consider  various  different  events  possible  in  our  token 
passing  service.  The  read-only  events  are 

Qi'-'F,  Qi’.F,  Ri'.cH,  Ri.eR,  Pi{j)'.cH,  Pi{j)  ■  tR, 

f 

whereas  the  following  are  update  events: 

Ri-.ok,  Pi(j):ok. 

Note  that  in  the  traditional  sense,  PASS  and  REQUEST  operations  do  not  commute. 
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For  example,  the  history 

H  -  (  Ry.ok,  Pl(3):ok  ) 

would  not  be  legal  if  we  reversed  the  order  of  the  two  events.  If  the  PASS  operation 
is  invoked  before  the  REQUEST  operation  it  should  return  ErrorRequest  instead 
of  OK.  However,  according  to  our  definition  the  token  passing  specification  is  com¬ 
mutative.  This  is  because  we  require  two  update  events  a,  b  to  commute  only  if 
both  events  are  legal  independent  of  each  other  (Ha  €  5  and  Hb  €  S  for  some  H). 
Hence  the  fact  that  the  two  events  Rz:ok  and  Pi(3):ok  do  not  commute  does  not 
affect  the  commutativity  of  the  specification,  because  the  second  event  (Pi(3):ok) 
is  never  legal  without  the  first.  Formally: 

-3  HeS:  H  +  Ry.ok€S  A  H  +  Pi(3):ok  €  S 

A  complete  analysis  of  the  token  specification  shows  that  any  two  update  events 
either  commute  or  have  the  property  that  one  is  never  legal  without  the  other  (see 
Appendix  A).  Hence  the  token  passing  specification  is  commutative. 

The  intuitive  reason  for  defining  commutative  specifications  this  way  is  the  fol¬ 
lowing.  If  there  are  two  updates  a,  b  of  this  type  (b  is  not  legal  without  a)  then 
these  two  events  will  not  occur  concurrently  in  an  execution.  The  two  events  will 
always  be  related  by  information  flow  (a  — *  b).  Because  CBCAST  preserves  all 
processors  will  observe  a  before  b.  Therefore  it  does  not  matter  whether  the  two 
events  commute. 

In  the  token  passing  specification  there  are  two  types  of  ordering  constraints: 
the  first  one  between  certain  QUERY  and  PASS  events 


Qi.F  *-►  Pj(i):ok, 
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the  second  one  between  an  unsuccessful  PASS  event  and  a  request  event: 

Pi(j):eR  i-'  Rj.ok 

A  complete  table  of  dependencies  for  token  passing  events  is  given  in  the  appendix. 

5.2.2  Applying  Commutativity  to  Runs 

How  do  the  concepts  discussed  in  the  previous  section  help  us  construct  a  lineariza¬ 
tion  operator  for  a  commutative  specification?  Our  plan  is  the  following: 

1.  We  assume  that  we  can  compute  the  ordering  constraints  “>-*•”  between  any 
two  pairs  of  events.  Given  a  run  R ,  we  construct  what  we  call  the  closure  of 
R  (71)  by  adding  extra  edges  to  R:  For  all  events  a,b  £  R  that  are  concurrent 
in  R ,  we  add  an  edge  a  — ►  b  if  the  ordering  constraint  a  *-*  b  holds. 

2.  Provided  that  7?  has  no  cycles,  we  define  LIN(R)  by  arbitrarily  picking  a 
linearization  H  of  7?.  That  is,  we  pick  a  history  H  that  contains  the  same 
events  as  R  and  has  a  total  order  that  is  consistent  with  +”and  u>~+'\ 

Figure  5.1  shows  an  example  of  applying  this  method  to  a  run  of  the  token  passing 
service.  It  shows  a  run  R  and  its  closure.  R  is  represented  by  the  circles  (events)  and 
solid  arrows  (information  flow  relation  between  those  events).  7?  is  given  by  R  plus 
the  dashed  arrows  (ordering  constraints).  We  can  get  a  legal  history  for  this  run  by 
ordering  all  its  events  in  such  a  way  that  the  partial  order  given  by  the  solid  and 
dashed  arrows  is  preserved.  The  history  H  given  below  the  diagram  shows  one  pos¬ 
sible  linearization.  We  formalize  this  method  below  and  show  that  the  linearization 
operator  defined  this  way  works  in  the  case  of  commutative  specifications. 


Figure  5.1:  An  example  run  and  one  of  its  linearizations 


Definition  5.5 

The  closure  7?  of  a  run  R  is  the  run  R  augmented  by  edges  between  any  two 
concurrent  events  a  and  b ,  whenever  a  >-*  b,  or  formally: 

a  —»  b  g  7?  O  (a  — »  b  g  R  V  a//b  g  R  A  aw  b) 

Definition  5.6 

77(72)  =  [H  €  S  |  H  is  a  linearization  of  Tt} 

“smallest”  H  in  77(72)  if  77(72)  ^  0 

LINs(R)  =  '  V 

J.  otherwise 

Recall  from  Theorem  4.1  that,  to  show  that  the  CBCAST  implementation  will  be 
correct  with  this  linearization  operator,  we  have  to  prove 

LINs(R)  €  S  for  every  locally  correct  run  72, 

where  local  correctness  means  LIN(R[a})  g  S  for  every  a  g  R.  As  defined  above. 
LI  Ns,  simply  picks  one  possible  linearization  of  7?  to  map  a  run  R  to  a  history. 


Hence  in  this  case  local  correctness  of  R  implies  that  every  f?[a]  (for  a  in  R)  has  a 
legal  linearization.  We  call  such  a  run  weakly  plausible: 

Definition  5.7 

R  is  weakly  plausible  &  V  a  €  R-  3  legal  linearization  of  f2[a]. 

If  not  just  one,  but  all  linerizations  of  f?[a]  are  legal,  then  we  call  this  run  strongly 
plausible: 

Definition  5.8 

R  is  strongly  plausible  iff 

V  a  €  R :  3  legal  linearization  of  f?[o) 

A  every  linearization  of  fffaj  is  legal. 

The  relationship  between  local  correctness  under  LI  Ns  and  strong  and  weak  plau¬ 
sibility  is  the  following:  Strong  plausibility  implies  local  correctness,  and  local  cor¬ 
rectness  implies  weak  plausibility.  We  will  show  (Lemma  5.4  below)  that  for  com¬ 
mutative  specifications  these  two  forms  of  plausibility  are  in  fact  equivalent.  Hence 
a  run  is  locally  correct  if  and  only  if  it  is  plausible  (strong  or  weak).  Therefore 
we  only  need  to  show  that  LINs(R)  €  S  for  strongly  plausible  runs  R.  The  next 
lemma  will  allow  us  to  do  this. 

Lemma  5.3 

If  R  is  strongly  plausible  then  every  linearization  of  H  is  legal. 


Proof:  Induction  on  the  numbeT  of  events  in  R:  Trivially  satisfied  for  empty  runs, 
because  empty  histories  are  always  legal.  Now  assume  R  non  empty: 
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Case  1:  If  ft  has  a  unique  maximal  element  a  then  ft  =  ft[a],  and  our  Haim  follows 
from  Definition  5.8. 

Case  2:  Let  ft-  be  an  arbitrary  linearization  of  7?,  let  a  be  the  last  event  in  H, 
and  let  b  £  a  be  a  maximal  element  of  ft.  H  can  be  written  as  H  =  H'bH"a.  The 
history  H\  =  H'bH"  is  a  linearization  of  ft  —  a.  By  induction  hypothesis  H\  is  legal. 
Similar,  Hi  =  H'H"  a  is  legal  as  a  linearization  of  ft  —  b.  Let  H'"  be  equal  to  H" , 
except  that  all  read-only  events  are  removed  from  H"1 .  If  b  is  a  read-only  event 
then 

H  =  H'bH" a  =  H'  H"  a  =  Hi  €  S 

and  we  are  done.  Otherwise,  b  commutes  with  every  event  in  H'";  hence 

H'H"'b  =  H'bH'"  =  H'bH"  =  Hi  €  S,  and 
H'H'"a  =  ff'JTa  =  €  5. 

Then  H'H'"ba  6  S,  because  otherwise  there  would  be  an  ordering  constraint  a  b, 
but  then  a  could  not  be  the  last  event  in  a  linearization  of  ~R.  Therefore 

H  =  H'bH" a  =  ffb^a  =  H'^ba  €  5. 


□ 

We  can  now  prove  that  for  commutative  specifications  weak  and  strong  plausibility 
are  equivalent. 

Lemma  5.4 

If  5  is  commutative  then 

(i)  Weak  and  strong  plausibility  are  equivalent. 

(ii)  Every  linearization  of  a  plausible  run  is  equivalent. 
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Proof:  (i)  We  have  to  show  that  every  weakly  plausible  run  R  is  also  strongly 
plaus.ble.  Induction  on  the  number  o*  events  in  R:  Trivially  satisfied  for  empty 
runs,  because  empty  histories  are  always  legal. 

Now  consider  a  non-empty,  weakly  plausible  run  R.  Assume  R  is  not  strongly 
plausible.  Then  there  must  be  a  left  subrun  f?[a]  with  a  legal  linearization,  say  Ha , 
such  that  some  other  linearization  H'a  is  not  legal.  These  two  histories  only  differ 
in  the  order  of  events  that  are  concurrent  in  ff[a|.  We  may  transform  one  into  the 
other  by  swapping  adjacent  concurrent  events.  Thus  we  get  a  sequence  of  histories 

H\d,  Hid,  Hid ,  . . . ,  Hna ,  where  H\  =  H  and  Hn  =  H \ 

in  which  Hi  and  Hi+\  differ  only  in  the  order  of  two  adjacent  events.  If  H'a  £  S 
then  the  sequence  must  contain  two  consecutive  histories, 

Hi  —  Ab\b-iB,  Hi+i  =  Abib\B 

such  that  Hid  is  legal  but  ffi+ia  is  not.  Note  that  Hi  and  Hi+\  are  linearizations 
of  R!  =  ff[a]  —  a.  By  induction  hypothesis  R!  is  strongly  plausible.  By  Lemma  5.3 
Hia  and  Hi+\a  are  both  legal.  Because  specifications  are  prefix-closed,  Ab\bi  and 
A&261  must  also  be  legal.  If  5  is  commutative  then  these  last  two  histories  are 
equivalent,  and  hence  Hid  and  Hi+  \a  should  either  both  be  legal  or  both  be  illegal. 
This  contradicts  our  earlier  assumption. 

(ii)  Let  H  and  2T  be  two  linearizations  of  a  plausible  run  R.  We  have  to  show  that 
H'  =  H.  H  and  2T  differ  in  the  order  of  events  that  are  concurrent  in  H.  Again,  we 
transform  one  into  the  other  by  swapping  adjacent  concurrent  events,  leading  to  a 
sequence  of  histories: 


Hi,  Hi,  Hi,  ...,  Hn,  where  H\  —  H  and  Hn  —  H1. 
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We  can  write  Hi  and  H{+\  as 

Hi  =  Ab\l>iB,  Hi+\  =  Ab%b\B. 

From  part  (i)  we  know  that  R  is  strongly  plausible;  hence,  by  Lemma  5.3,  H,  and 
Hi+ 1  are  both  legal.  Therefore  their  prefixes  Ab\  and  Abi  are  legal.  If  one  of  the 
two  events  (say  &i)  is  a  read-only  event  then 

Ab\bi  =  Abi  =  Ab^b\ . 

If  both  events  are  updates  then  they  must  commute  In  any  case,  we  have  Ab\bi  = 
Abib\  and  therefore  Hi  =  Hi+\.  By  transitivity  H  =  H1 .  □ 

This  lemma  now  allows  us  to  show  that  the  linearization  operator  we  introduced 
in  this  section  (Definition  5.6)  can  be  used  to  construct  asynchronous  implementa¬ 
tions. 

Definition  5.9 

A  commutative  specification  5  is  acyclic  iff 
V/2:  R  plausible  ^  ~R  acyclic. 

Otherwise  we  say  5  is  cyclic. 

Theorem  5.1 

If  5  is  commutative  and  acyclic  then  the  CBCAST  implementation  with  LI  Ns 
as  its  linearization  operator  is  correct. 

Proof:  We  will  show  that  if  every  plausible  run  has  an  acyclic  closure  then  LI  Ns 
is  constructive  and  satisfies  LINs(R)  €  5  for  every  locally  correct  run  R.  The  claim 
then  follows  from  Theorem  4.2. 
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(i)  LI  Ns  is  constructive:  Let  R  6  S,  a  6  I.  We  have  to  show  that  there  is  a  return 
value  v€V  such  that  LINs(R  +  a:v)  €  S.  LINs  is  defined  in  such  a  way  that  the 
order  of  events  in  H  =  LINs(R  +  a:v)  is  independent  of  the  choice  for  the  return 
value  v,  that  is 

LINs{R  4-  a: v)  =  LINs(R)  +  a:v ,  for  all  v  such  that  LINs(R)  +  a:v  e  S 

Because  specifications  are  complete  (Definition  3.3)  LINs(R)  €  5  implies  that  such 
a  value  always  exists. 

(ii)  LIN(R)  €  S  for  every  locally  correct  12:  Local  correctness  of  12  implies  that 

R  is  weakly  plausible.  By  Lemma  5.4,  R  is  strongly  plausible.  By  Lemma  5.3, 
every  linearization  of  7?  is  legal.  If  7?  is  acyclic  then  such  a  linearization  exists,  and 
H(R)  ±  0.  Hence  LINS(R)  €  5.  □ 

5.2.3  Proving  Acyclicity 

In  Chapter  4  we  gave  an  example  of  a  specification  for  a  simple  counter  that  does 
not  have  a  CBCAST  implementation.  This  specification  is  commutative:  READ  oper¬ 
ations  are  read-only  and  INC  operations  commute.  However,  the  acyclicity  require¬ 
ment  in  Theorem  5.1  is  not  satisfied  as  the  example  in  Figure  5.2  shows.  The  run 
in  this  figure  is  plausible;  for  example 

R[Read\  :  6]  *  Inc j(l)  -♦  /nci(5)  -*  Read\  :  6 
has  only  one  linearization,  and  this  linearization  is  legal.  However  the  closure  of  R 
has  a  cycle 

Inc\(5)  —*  Rtad\  :  6  7ncj(3)  —*  Reads  ■  4  *-*  Inc\(5). 

As  we  have  already  seen  in  Section  4.2.1,  this  problem  has  no  asynchronous  imple¬ 
mentation. 
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/nci(5)  Readi.Q 


Figure  5.2:  An  example  run 

In  this  section  we  will  present  techniques  for  deciding  whether  a  specification 
is  cyclic  or  not.  We  will  illustrate  our  techniques  by  applying  them  to  our  token 
passing  example. 

Definition  5.10 

Let  R  be  a  run  with  a  cycle  C  in  7?: 

C  =  t\,\  -*  «1,J  -*•  •••  -»  Cl.nj 

ej.i  ea,J  “♦ 

Cm,l  «m,3  — *  •  •  •  “ * 

*1,1 

We  call 

««',1  -*  *«,  2  *«■»» 

(for  i  =  1 . . .  m)  a  segment  of  the  cycle. 
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Lemma  5.5 

Every  cycle  in  the  closure  77  of  a  run  R  has  at  least  two  segments. 

Proof:  Because  R  itself  is  acyclic,  every  cycle  in  7?  must  contain  at  least  one 
edge.  Consider  a  cycle  with  only  one  segment: 

C  —  ei  — »  ti  . . .  — ♦  en  ei 

According  to  Definition  5.5  7?  contains  edges  only  between  events  that  are 
concurrent  in  R.  Since  ei  -♦  e»  (by  transitivity)  7?  cannot  contain  the  edge  en  *-*■  ei. 
□ 

Lemma  5.0 

If  77  has  a  cycle  then  it  also  has  cycle  in  which 

(i)  all  segments  axe  concurrent,  i.e.,  a//b  for  any  two  events  a  and 
b  in  different  segments. 

(ii)  every  segment  has  at  most  two  events. 

Proof:  (i)  Let 

«l,l  «l,2  •••  cl,«i 

ea.i  -*  ea.a  -*■  •  •  •  -*■  ea.»j 

6fn,l  ®m,  2  Cm,«m 


*-*  «1,1 
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be  a  cycle  in  75.  Assume  C  has  two  non- concurrent  segments.  Then  there  are  two 
events  a  =  eij  and  b  =  e*/,  such  that  a  — *  b  in  R.  We  can  use  .Lis  relation  to 
construct  a  smaller  cycle 

C'  =  a  ->  b  efcij+1  -*  ...  -*  eknk 

*-*  «k+l,l  -*  «k+l,2 

*-*  «i,l  — »  ...  —* 

The  cycle  C'  has  strictly  less  segments  than  C,  because  C'  does  not  contain 


a 

«i+l,l 


C«J+1 


ti 


“*  €i+l,i+l 


e*,i  -♦  •••  “»  e*,l- 1  b 

which  has  been  replaced  by  the  “short  cut”  a  —*  b.  We  repeat  this  process  until  the 
resulting  cycle  no  longer  has  non-concurrent  segments.  Lemma  5.5  ensures  that  the 
process  need  only  be  repeated  a  finite  number  of  times. 


(ii)  Consider  a  segment 


««,1  ~ ♦  e»,J  -♦  ...  — »  e,,«. 


that  has  more  than  two  events.  Because  of  the  transitivity  of  this  segment  can 
be  replaced  by  the  shorter  segment 


ei,l  *».«<• 


□ 
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This  lemma  expresses  the  following  intuitive  idea:  The  CBCAST  implementation 
breaks  down  if  two  different  processors  take  mutually  inconsistent  actions  without 
knowing  about  the  others  action.  The  inconsistency  of  these  actions  is  expressed  as 
a  cycle  in  a  run.  The  fact  that  the  two  processors  do  not  know  about  each  other’s 
actions  is  expressed  by  the  corresponding  events  being  concurrent. 

Specifications  that  are  acyclic  have  the  property  that  certain  types  of  events 
which  are  part  of  ordering  constraints  can  never  occur  concurrently  in  a  plausible 
run.  We  call  such  events  mutually  exclusive. 

Definition  5.11 

Two  events  a  and  b  are  mutually  exclusive  under  5  if 
V  R  :  R  plausible  =*►  -<a//b. 

We  prove  a  specification  to  be  acyclic  by  showing  that  any  cycle  in  the  closure  of  a 
plausible  run  would  contain  mutually  exclusive  events  in  different  segments  of  the 
cycle.  This  would  force  all  cycles  to  have  non-concurrent  segments.  However,  this 
contradicts  Lemma  5.6,  which  we  just  proved. 

Let  us  return  to  our  token  passing  example.  We  will  now  prove  that  every 
plausible  run  of  the  token  passing  specification  has  an  acyclic  closure. 

Theorem  5.2 

Consider  the  token  passing  example.  Two  successful  PASS  events  of  the  form 
a  =  Pi(x):ok,  and  b  =  Pj{y):ok ,  for  *  ^  j 


are  mutually  exclusive. 
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This  claim  is  very  intuitive.  If  two  such  events  were  not  mutually  exclusive  there 
could  be  two  processors  holding  the  token  at  the  same  time,  violating  the  token 
passing  specification. 

Proof:  Consider  a  run  R  with  two  concurrent  pass  events  a  =  Pi(x):ok  and 

b  =  Pj(y):ok.  We  show  that  R  cannot  be  plausibie  by  induction  on  the  number  of 
events  in  R. 

Base  case:  R  contains  no  events  other  than  a  and  b .  Assume  R  is  plausible.  Then 
i?[a]  =  a  and  R[b]  =  b  must  have  legal  linearizations  Ht  =  (a)  and  H\,  =  ( b ) 
respectively.  Because  processor  1  is  the  initial  token  holder  Ha  =  ( Pi(x):ok )  can 
only  be  legal  if  t  =  1.  For  the  same  reason  is  only  legal  if  ;  =  1.  But  i  =  j 
contradicts  the  assumption  that  a  and  b  are  concurrent. 

For  the  induction  step  consider  R  with  more  than  two  events.  Assume  R  is  plausible. 
Then  J?|a]  and  R[b)  have  legal  linearizations  Ha  and  ffj  respectively.  Let  R!  = 
fE[a]  nf?[6].  By  induction  hypothesis  f2[a],  f?[6],  as  well  as  f?  do  not  have  concurrent 
pass  events.  Therefore  we  can  define  the  following  events: 

c  =  P{(z):ok  The  last  pass  event  in  R!. 

a'  —  Pi(x):ok  The  first  pass  event  after  c  in  J?[a]. 

V  =  Pj(y):ok  The  first  pass  event  after  c  in  R[b). 

(where  possibly,  but  not  necessarily  a'  =  a  and/or  V  —  b.)  Note  that  a'  f/V  because 
otherwise  either  a'  or  V  would  be  in  R!.  Then  the  histories  H%  and  H\,  have  the 
form 

Ha  =s  ...  Pi(z):ok  ...  Pi(x):ok  ...  a 

Hi  =  ...  Pi(z):ok  . . .  Pj(y):ok  ...  b 
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with  no  pass  events  between  c  and  a'  in  Ha  and  between  c  and  b1  in  Hk.  Then  Ha 
can  only  be  legal  if  i  —  z,  otherwise  the  operation  P;(x)  should  return  an  error  code 
tH.  For  the  same  reason  Hh  is  only  legal  if  j  =  z.  Hence  i  =  j,  i.e.,  a'  and  b'  are 
events  at  the  same  processor.  But  that  contradicts  a'//V.  □ 

Not  only  pass  operations  but  any  two  events  that  indicate  that  the  caller  is  the 
current  token  holder  are  mutually  exclusive: 

Theorem  5.3 

Any  two  events  of  the  following  types  are  mutually  exclusive: 

Qi'.T,  Ri'-eH,  Pi(x):ok ,  or  Pi(x):eR 

The  proof  is  very  similar  to  the  one  for  Theorem  5.2;  it  is  carried  out  in  Appendix  A. 

Now  let  us  consider  the  ordering  constraints  occurring  in  the  token  passing  spec¬ 
ification.  These  constraints  are  of  one  of  the  following  three  types  (see  Appendix  A): 

(I)  Qi.F  ~  Pj(i):ok 

(II)  Ri'.eR  ►-»  Pj(i):ok 

(III)  Pi(j):eR  ~  Rj.ok 

Theorem  5.4 

The  token  passing  specification  is  acyclic. 

Proof:  Assume  not.  Let  A  be  a  plausible  run  that  contains  a  cycle.  By  Lemma  5.6 
we  may  assume  that  the  cycle  only  has  concurrent  segments.  Consider  the  ordering 
constraint  edges  (“*-»”)  in  such  a  cycle.  The  cycle  cannot  contain  more  than  one 
edge  of  type  (III),  otherwise  there  would  be  two  pass  events  in  different  segments 
of  the  cycle,  which  is  not  possible  since  segments  are  concurrent  and  pass  events 
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axe  mutually  exclusive.  For  the  same  reason  there  cannot  be  more  than  one  edge  of 
type  (I)  or  (II)  in  the  cycle.  By  Lemma  5.5  the  cycle  has  at  least  two  edges: 
hence  it  must  have  exactly  one  edge  of  type  (III)  and  one  of  type  (I)  or  (II).  Hence 
the  cycle  is  of  the  following  form: 

C  =  Pj(i):ok  -  Pk(l):tR 
h- ►  R(-.ok  — ►  e 
•-»  Pj(i):ok 

where  either  e  =  Qc.F  or  t  =  R^ieR. 

The  first  segment  of  the  cycle  consists  of  the  two  pass  events  a  =  Pj(i):ok  and 
b  —  Pk(l):eR.  If  R  is  plausible  than  has  a  legal  linearization  Hk.  Because 
a  — ►  b,  a  is  in  R[b\  and  therefore  also  in  Hk.  Hence  Hk  has  the  form 

Hh  =  ...  Pj(i):ok  ...  Pk(l):cR 

Notice  that  the  return  value  cR  of  the  last  event  (the  pass  operation  failed  because 
processor  /  did  not  request  the  token)  indicates  that  processor  k  is  holding  the  token 
at  that  time.  Therefore  Hk  must  contain  a  pass  event  c  —  Pi(x):ok  between  the  two 
events  a  and  b  in  Hk\  otherwise  processor  i  would  still  be  holding  the  token  at  the 
end  of  Hk.  From  Theorem  5.3  we  know  that  c  cannot  be  concurrent  with  a  or  b ; 
hence 

a  — ►  c  — »  5. 

Now  consider  the  event  e  in  the  second  segment  of  the  cycle.  Events  c  and  e  cannot 
be  concurrent,  because  the  operations  were  both  invoked  at  processor  i.  If  c  — ♦  e 
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we  have  a  — *  c  — ►  e;  hence  a  — ►  e.  If  e  — ►  c  we  have  e  — ►  c  — ►  6;  hence  e  — +  6.  In 

botu  cases  the  two  segments  of  the  cycle  would  not  be  concurrent,  contradicting 
Lemma  5.6.  □ 

Let  us  summarize  our  techniques  for  deciding  whether  a  specification  is  acyclic. 
Lemma  5.6  allows  us  to  restrict  our  search  for  cycles  to  certain  simple  types  of  cycles 
with  the  following  three  properties: 

1.  The  cycle  has  at  least  two  segments,  i.e.,  it  contains  at  least  two  edges. 

2.  Every  segment  has  exactly  two  events. 

3.  All  segments  are  mutually  concurrent. 

Because  of  properties  1  and  3,  such  a  cycle  must  have  concurrent  events  that  occur 
in  an  ordering  constraint.  Therefore  we  are  successful  if  we  can  show  that  events 
that  are  involved  in  ordering  constraints  are  mutually  exclusive,  i.e.,  do  not  occur 
concurrently  in  a  plausible  run. 

5.3  Mixed  Implementations 

The  techniques  we  outlined  in  the  previous  sections  are  still  useful  if  a  specification  is 
cyclic  or  even  if  it  is  not  strictly  commutative.  In  the  case  where  these  techniques  fail 
to  produce  a  correct  asynchronous  implementation  for  a  specification  5,  it  is  often 
not  necessary  to  resort  to  an  implementation  that  is  based  on  atomic  broadcasts 
only.  Instead,  it  is  often  possible  to  construct  a  mixed  implementation  in  which  most 
events  are  propagated  with  C3CAST  and  ABCAST  is  used  only  for  certain  “critical” 
events.  For  example,  consider  a  service  for  managing  shared  data,  If  clients  are 
required  to  explicitly  acquire  locks  before  modifying  the  data,  then  only  LOCK  and 
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UNLOCK  operations  need  to  be  globally  ordered.  Once  a  lock  is  granted  the  actual 
updates  may  be  propagated  asynchronously  (JB86j.  The  techniques  developed  in 
the  previous  sections  of  this  chapter  allow  us  to  identify  what  events  are  “critical”: 
events  that  do  not  commute  and  events  that  occur  in  cycles. 

In  this  section  we  will  outline  how  the  results  from  this  and  the  preceding  chap¬ 
ter  can  be  generalized  to  apply  to  such  mixed  implementations.  We  modify  our 
definition  of  implementation  by  adding  a  parameter  A  defining  the  set  of  “criti¬ 
cal  ”  operations  that  must  be  propagated  by  an  atomic  broadcast.  That  is,  an 
implementation  is  now  a  9- tuple: 

(n,  I,V,M Q,  qo ,  $,  .4),  where  A  C  /. 

We  also  need  a  new  ordering  axiom  that  defines  mixed  implementation  histories: 

Definition  5.12 

Ordering  axiom  for  mixed  implementations: 

(i)  Causal  ordering: 

j) -» inv^{/,m)  =>  Vfc:  rcv£((i,j),k)<krcvE{(l,m),k) 

(iia)  V invE(i,j)  6  A:  (Global  ordering) 

VJ U: 

rcY£((i,j)tk)  <k  rcvE((i',j'),k)  *  rcv£((i,j),l)  <,  rcvE((i',j'),i) 

(iib)  V  inv£(i,j)  6  I  —  A:  (Immediate  local  delivery) 

Vi,;:  3a  :  invE{i,j)  <k  a  <k  rcvE((i,j),i) 

The  axiom  requires  that  all  message  delivery  must  be  consistent  with  “— ►”  (i), 
that  all  message  delivery  must  be  globally  ordered  with  respect  to  messages  sent  by 
atomic  broadcast  (iia),  and  that  messages  sent  by  CBCAST  are  immediately  delivered 
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locally.  Notice  that  if  A  =  0  (all  events  propagated  by  CBCAST)  this  definition  is 
equivalent  »  tue  CBCAST  ordering  axiom  (Definition  4.5). 

A  mixed  implementation  is  constructed  the  same  way  as  a  CBCAST  implemen¬ 
tation,  based  on  a  linearization  function.  However,  the  correctness  condition  can 
be  relaxed,  because  certain  types  of  runs  cannot  occur  in  an  execution  of  a  mixed 
implementation.  We  formalize  this  below: 

Definition  5.13 

A  run  R  is  called  permissible  under  A  if  events  with  invocations  in  A  are 
globally  ordered  with  respect  to  all  other  events: 

V e  €  A  x  V:  Ve'  €  R:  -  e//e 

Lemma  5.7 

Let  Y  be  a  mixed  implementation  and  let  E  be  a  mixed  execution  history. 
Then  Ry{E)  is  permissible. 

Proof:  Follows  immediately  from  Definition  5.12  (iia).  □ 

Because  of  this  property  of  mixed  implementation,  the  correctness  condition  that 
we  established  for  CBCAST  implementation  (Theorem  4.1)  needs  to  be  satisfied  only 
for  permissible  runs: 

Theorem  5.5 

If  LIN  is  constructive  and  satisfies 

V  permissible  runs  R :  R  locally  correct  ^  LIN(R)  €  5. 
then  Ys,lin  is  *  correct  mixed  implementation  of  specification  5. 
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Proof:  We  show  that  for  every  mixed  execution  history  £,  the  history  H  = 

LI N(Ry(E))  is  legal  and  satisfies  the  correctness  and  liveness  conditions  of  Defini¬ 
tion  3.12. 

Let  £  be  a  mixed  execution  history,  and  let  R  =  Ry{E).  By  Lemma  4.5,  R  is 
locally  correct.  By  Lemma  5.7,  R  is  also  permissible.  Therefore,  by  assumption, 
the  history  H  =  LIN(R)  is  legal.  The  rest  of  the  proof  is  exactly  the  same  as  the 
proof  of  Theorem  4.1  on  page  64.  □ 

This  theorem  also  allows  us  to  generalize  the  results  of  Section  5.2.2. 

Corollary  5.2 

If  a  commutative  specification  5  satisfies  the  following  condition 
V permissible  A:  R  plausible  =>  7?  acyclic, 
then  the  mixed  implementation  under  LI  Ns  (Definition  5.6)  is  correct. 

In  the  previous  section  we  demonstrated  how  to  prove  that  a  specification  is 
acyclic  by  showing  that  events  that  could  form  a  cycle  do  not  occur  concurrently 
in  a  plausible  run  (i.e.,  are  mutually  exclusive).  Events  that  could  form  a  cycle  but 
are  not  mutually  exclusive  cause  this  technique  to  fail.  However,  by  Corollary  5.2, 
a  mixed  implementation  will  be  correct  if  such  events  are  propagated  by  atomic 
broadcast,  ensuring  that  they  do  not  occur  concurrently  in  a  permissible  run. 

In  a  similar  way  it  is  possible  to  extend  our  results  to  mixed  implementations 
in  which  events  that  are  not  commutative  are  propagated  by  atomic  broadcast. 
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5.4  Summary 

In  this  chapter  we  saw  that  in  general,  the  question  of  whether  a  linearization  op¬ 
erator  exists  for  a  given  specification  is  undecidable.  Hence  there  are  no  general 
methods  for  finding  such  operators.  Therefore,  we  considered  a  restricted  class 
of  specifications  which  we  call  commutative.  We  show  how  to  exploit  knowledge 
about  the  commutativity  of  events  to  construct  linearization  operators  for  specifi¬ 
cations  in  this  class.  These  methods  are  useful  for  developing  efficient  asynchronous 
implementations  for  a  broad  range  of  practical  problems. 


Chapter  6 


Failures 


In  our  treatment  so  far  we  have  assumed  a  distributed  system  that  is  perfectly 
reliable.  However,  one  of  the  main  uses  of  broadcast  protocols  is  in  the  design  of 
fault- tolerant  programs.  In  this  chapter  we  will  address  the  problems  that  arise  if 
we  take  processor  failures  into  account. 

6.1  Integrating  Failures  into  the  Model 

In  chapters  3  and  4  we  showed  how  to  take  a  formal  specification  of  a  centralized 
service  and  use  broadcast  protocols  to  construct  a  distributed  implementation  of 
this  service.  Now  we  want  to  make  the  distributed  service  fault  tolerant.  What  we 
mean  by  “fault  tolerant”  is  that  even  if  some  processors  fail,  the  behavior  of  the 
distributed  service  should  be  indistinguishable  from  a  perfectly  reliable  centralized 
server.  As  we  will  see  in  this  section  we  achieve  this  goal  simply  by  replacing  the 
broadcast  protocols  used  in  the  implementations  constructed  in  Chapters  3  and  4 
by  reliable  versions  of  the  same  protocol.  In  other  words,  if  the  broadcast  protocol 
used  in  an  implementation  provides  atomic  message  delivery,  the  implementations 


102 


will  automatically  be  fault  tolerant. 

To  be  more  precise,  we  must  aa.l  failures  to  our  execution  model.  An  execu¬ 
tion  history  may  now  contain  failure  events  in  addition  to  invocation  and  receive 
events.  We  modify  our  definition  of  execution  histories  (definitions  3.4  and  3.7) 
accordingly: 

Definition  6.1 

An  unreliable  execution  history  E  =  (£i,  ...,£*)  is  a  collection  of  ordered  sets 
of  invocation,  receive,  and  failure  events, 

E  G  ((/UJV2U{FAIL})*]", 

satisfying  the  following  conditions: 

(i)  Reliable  message  delivery: 

Vmv£(i,  j) :  V  Jfe:  3  unique  receive  event  (i,j)  €  Eu 

(ii)  Sequential  invocation: 

Vx,j:  rcvE((i,j),i)  <,  inv^*,;'  +  1) 

(iii)  Monotonicity  of  time: 

_  D  D  D  D 

-i  3  e\, . . .  ,em  6  E:  — *  ea  cm  — »  ei. 

(iv)  No  invocation  events  after  a  failure: 

Vi b:  FAIL  6  Ek  3 invocation  event  o  €  Ek‘-  a  >k  FAIL 

Conditions  (i  —  iii)  are  exactly  the  same  as  in  definitions  3.4  and  3.7.  For  notational 
convenience  we  pretend  that  a  processor  that  has  crashed  still  receives  broadcasts. 
Hence  a  failure  is  simply  an  event  after  which  a  processor  stops  sending  any  new 
messages  (condition  (iv)).  Figure  6.1  illustrates  such  an  execution  history.  Notice 
that  this  model  describes  an  implementation  based  on  reliable  broadcast  protocols, 


Figure  6.1:  An  execution  history  with  failure  events 
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because  condition  (i)  in  Definition  6.1  ensures  atomic  message  delivery. 

The  definition  of  an  implementation  as  an  8-«,l  t/le  (n,  /,  V,  M,  Q,  qo,  $,  $  )  remains 
unchanged,  but  we  have  to  specify  the  effect  of  failure  events.  We  do  so  by  defining 
the  state  of  a  processor  after  a  failure  to  be  undefined  (_L),  that  is  we  modify  the 
definition  of  stat£[i,  j]  as  follows: 


Definition  6.2 

stat£[j,j] 


90 

X 

X 


(statf  [*,  j—i],  a) 
0f(stat£[i,;-l],m) 


if  j  =  0 

if  E[i,j]  =  FAIL 

if  states, y-1]  =  X 

if  j£[i,  j]  =  a  is  an  invocation  event 

if  E[i,j)  =  (k,  l)  is  a  receive  event,  where 

m  =  <f>T(statE[k,  inumE(k,  invE(k,  l)) 


The  definitions  of  msg£{i,j ),  vaj£(t,j),  event£(i,  j),  H[E,i]  remain  the  same  as 
before  (Definition  3.11).  In  an  unreliable  system  we  define  an  implementation  to  be 
correct  if  all  operational  sites  cannot  distinguish  its  behavior  from  that  of  a  perfectly 
reliable  centralized  service: 

Definition  6.3 

Y  is  a  correct  XBCAST-implementation  of  specification  S  =  ( n ,  /,  V,  S)  iff: 

V  xbcast  execution  history  E :  3  H  €  5 : 

Correctness:  V » :  if  FAIL  £  E{  then  H  j,'  =  H[E,  t] 

Liveness:  V  i,j,  k: 

rcvE((iJ)>k)  <*  inv£(k,  I)  =>  event £{*',»  <£  event  E{kJ) 


The  fact  that  in  our  execution  model  message  delivery  is  reliable  ensures  that 
processors  that  a  not  fail  are  not  affected  by  the  failure  of  other  processors.  This 
is  expressed  in  the  following  lemma: 

Lemma  6.1 

Let  E  be  an  execution  history  with  failure  events,  and  let  E '  be  identical  to  E 
except  that  all  failure  events  are  deleted  from  it.  Then 

(i)  E'  is  an  well  formed  execution  history. 

(ii)  V  i :  if  Ei  does  not  contain  a  failure  event  then  H[E,  t]  =  H[E',  *]. 

Proof:  (i)  As  an  unreliable  execution  history,  E  satisfies  condition  (i  -  iii)  in 
Definition  6.1.  Condition  (i)  ensures  that  E*  is  an  execution  sequence  according 
to  Definition  3.4;  conditions  (ii,  iii)  ensure  that  this  execution  sequence  is  a  well 
formed  execution  history  (Definition  3.7). 

(ii)  By  construction,  all  invocations  in  H[E,  *]  and  H[E\  t]  are  identical.  Hence  we 
only  have  to  show  that  all  return  values  are  also  the  same.  Consider  the  formal 
event  e  =  a:v  =s  event^i,;)  €  Then  a  =  wv£{i,j)  and  v  =  v&lE(i,j). 

Let  6  =  E[i,mum£{{i,j),i)  —  1]  be  the  corresponding  receive  event.  Recall  that 
according  to  Lemma  4.1  vals(i,j)  only  depends  on  events  that  precede  b  under 
Hence  we  are  done  if  we  can  show  that  £[&}  =  £'[&].  Assume  that  E[b\  ^  E'[b\. 
This  is  only  possible  if  £{6]  contains  a  failure  event  /.  Because  a  processor  does 
not  send  any  messages  after  it  fails,  the  only  events  related  to  a  failure  event  /  are 
receive  events  after  /  at  the  same  processor.  Hence  /  €  E[b]  implies  that  /  €  Ei 
contradicting  our  assumption  that  Ei  does  not  contain  a  failure  event.  □ 
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Theorem  6.1 

Every  correct  implementation  of  a  specification  S  is  also  fault-tolerant. 

Proof:  Let  Y  be  a  correct  implementation  of  5  and  let  E  be  an  unreliable  execution 
history.  Let  E'  be  E  with  failure  events  deleted.  Because  Y  is  correct,  there  exists  a 
history  if  €  5  that  satisfies  the  correctness  and  liveness  conditions  with  respect  to 
E' .  By  Lemma  6.1  H[E,  i]  =  H[E\  i]  for  all  £,■  with  no  failure  events.  Therefore  H 
will  also  satisfy  the  correctness  and  liveness  conditions  with  respect  to  £  as  stated 
in  Definition  6.3.  □ 

6.2  Client  Failures 

A  processor  failure  not  only  affects  a  component  of  a  distributed  service,  but  also 
the  client  running  at  that  site.  The  designer  of  a  distributed  service  may  want  to 
explicitly  specify  a  particular  action  to  be  taken  if  a  client  fails.  In  the  token  passing 
service,  for  example,  it  is  desirable  that  the  token  is  not  lost  if  its  current  holder 
fails.  Hence,  we  would  want  to  specify  the  behavior  of  the  token  passing  service  in 
such  a  way  that  the  token  is  automatically  transferred  to  some  other  client  in  the 
case  of  a  failure. 

This  problem  can  be  solve  within  our  formalism  by  treating  a  client  failure  like 
any  other  operation  invoked  by  a  client.  In  other  words,  the  specification  is  designed 
as  if  a  client  invoked  a  special  operation  “CRASH”  just  before  its  processor  fails.  If 
the  distributed  system  provides  a  means  of  detecting  failures,  such  a  specification 
can  be  implemented  in  the  same  way  as  specifications  that  do  not  contain  client 
failures.  For  example,  the  ISIS  system  provides  a  failure  detection  and  notification 
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mechanism  that  makes  the  failure  of  a  processor  look  as  if  the  processor  sent  out  a 
broadcast  announcing  its  de„ih  just  before  the  failure  [BJ87b,BJSS86j. 

6.3  Summary 

In  this  chapter  we  showed  that  reliable  broadcast  protocols  can  be  used  to  construct 
a  fault-tolerant  distributed  service.  This  approach  is  very  similar  to  the  method  of 
replicated  state  machines  described  by  Schneider  in  [Sch86], 


Chapter  7 
Conclusion 

7.1  Summary  and  Discussion 

We  considered  a  variety  of  reliable  broadcast  protocols  that  differ  in  the  form  of 
message  ordering  they  provide:  atomic  broadcast  (ABCAST),  causal  broadcast  (CB- 
CAST),  FIFO  broadcast  (FBCAST),  and  unordered  broadcast  (BCAST).  The  stronger 
the  ordering  property  of  the  protocol  the  more  costly  its  implementation.  There  is  a 
fundamental  difference  between  atomic  broadcast  and  the  other  forms  of  broadcasts. 
An  atomic  broadcast  protocols  requires  at  least  two  phases  of  message  exchange, 
whereas  CBCAST,  FBCAST,  and  BCAST  can  be  implemented  as  one-phase  protocols. 
Furthermore,  in  an  unreliable  system  in  which  processors  may  experience  failures 
abcast  can  only  be  implemented  if  failures  are  detectable  or  if  an  upper  bound  on 
message  delays  is  known.  CBCAST,  FBCAST,  and  BCAST,  on  the  other  hand,  can  be 
implemented  reliably  in  a  completely  asynchronous  system. 

Our  results  from  Chapter  4  show  that  this  fundamental  difference  is  also  reflected 
in  the  classes  of  problems  that  can  be  solved  with  a  particular  broadcast  protocol. 
We  showed  that  the  class  of  all  formal  specification  separates  into  two  distinct 
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subclasses  Sasync  and  S  —  Sasync  <  which  correspond  to  specifications  that  have 
an  implementation  based  on  CBCAST,  FBCAST,  or  BCAST,  and  specifications  that 
require  the  global  ordering  that  ABCAST  provides. 

For  specifications  in  Sasync,  an  implementation  can  be  expressed  in  a  canon¬ 
ical  form  based  on  a  linearization  function  for  that  specification.  Although  the 
existence  of  such  a  function  in  general  is  undeddable,  it  is  possible  to  analyze  com¬ 
mutativity  and  dependencies  between  events  to  find  linearization  functions  for  a 
subclass  of  Sasync •  The  methodology  introduced  in  Chapter  5  allows  identification 
of  conflicting  events  and  establishes  conditions  that  allow  the  construction  of  an 
asynchronous  implementation.  Specifications  for  which  this  method  is  successful 
could  be  characterized  as  “self-synchronizing”,  that  is  the  specification  itself  pre¬ 
vents  certain  conflicting  events  from  occurring  concurrently.  Even  if  our  techniques 
fail  to  yield  a  completely  asynchronous  implementation  they  are  still  useful  for  con¬ 
structing  a  mixed  implementation,  as  they  identify  a  subset  of  events  that  need  to 
be  propagated  by  atomic  broadcast. 

7.2  Future  work 

It  should  be  possible  to  extend  the  results  from  Chapter  5  Notice  that  the  condi¬ 
tions  presented  in  that  chapter  are  sufficient  but  not  necessary  for  the  existence  of 
an  asynchronous  implementation.  This  naturally  raises  the  question  of  whether  the 
methodology  can  be  generalized  to  cover  a  larger  set  of  specifications.  For  example, 
there  are  non-commutative  specifications  that  have  asynchronous  implementations. 
Examples  are  specifications  in  which  only  operations  invoked  by  one  particular  pro¬ 
cessor  are  sensitive  to  the  order  of  the  events.  For  example,  one  can  modify  the 
token  passing  specification  to  require  token  requests  to  be  serviced  in  FIFO  order. 


Such  a  specification  is  not  commutative,  because  token  requests  no  longer  commute. 
However,  because  only  the  current  token  holder  decides  which  processor  receives  tht 
token  next,  it  is  not  necessary  that  token  requests  are  globally  ordered. 

Another  interesting  problem  is  to  generalize  our  formalism  to  allow  implemen¬ 
tations  that  exhibit  “temporary  inconsistencies”.  In  Chapter  4  we  showed  that 
the  problem  of  implementing  a  shared  counter  does  not  have  an  asynchronous  so¬ 
lution.  However,  for  certain  types  of  implementations  it  might  be  acceptable  if  a 
read  operations  returns  the  sum  of  only  a  subset  of  previous  increments,  as  long 
every  increment  is  eventually  reflected  in  all  future  reads.  The  formalism  presented 
in  Chapter  3  allows  us  to  relax  the  shared  counter  specification  to  allow  reads  to 
return  partial  sums.  However,  because  we  need  specifications  to  be  prefix-closed,  we 
cannot  express  the  requirement  that  increments  are  not  ignored  forever.  One  way 
of  solving  this  problem  might  be  to  define  specifications  as  sets  of  partially  ordered 
sets  of  events  (i.e.,  runs)  rather  then  sets  of  histories.  One  could  then  specify  a 
shared  counter  in  such  a  way  that  a  read  operation  is  allowed  to  ignore  an  incre¬ 
ment  only  if  it  is  concurrent  to  the  read.  The  drawback  of  this  approach  is  that 
specifications  no  longer  have  the  intuitive  meaning  of  ensuring  that  the  distributed 
program  behaves  behaves  like  a  centralized  server. 
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f  Appendix  A 

An  Example:  Token  Passing 

A.l  Formal  Specification 

We  want  to  implement  a  distributed  token  passing  algorithm.  The  client  interface 
consists  of  the  following  three  operations: 

•  QUERY0:  BOOLEAN 

—  returns  TRUE  if  the  caller  is  the  current  token  holder. 

•  PASS(X:  CLIENTlD):  RETURNCODE 

—  passes  the  token  from  the  current  token  holder  to  client  x. 

This  operation  returns  one  of  three  values:  OK,  ERRORHOLDER  (the  caller  is 
not  the  current  token  holder),  or  ERRORREQUEST  (client  z  did  not  request 
the  token). 

•  REQUEST0:  RETURNCODE 

—  request  the  token. 
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This  operation  returns  one  of  three  values:  OK,  ErrorHolder  (the  caller 
is  already  holding  the  token),  or  ERRORREQUEST  (the  caller  has  already  re¬ 
quested  the  token). 

We  use  the  following  abbreviated  notation  for  operations  and  return  values: 

Q  QUERY 

P  PASS 

R  REQUEST 

T  TRUE 

F  FALSE 

eH  ErrorHolder 

eR  ERRORREQUEST 


Given  a  formal  history  H,  we  identify  the  current  token  holder,  CurHold(H),  to  be 
the  client  that  token  was  last  passed  to,  where  client  1  is  the  initial  holder  of  the 
token: 


CurHold(H)  =  | 

We  can  further  define 
request  by  a  particular 


1  if  H  does  not  contain  any  successful  PASS  operations. 
x  if  the  last  successful  PASS  event  in  H  has 
the  form  Pi(x):ok,  for  some  i. 

a  predicate  that  tell  us  whether  there  is  a  pending  token 
client: 


PendRe<^H,x)  — 

A  formal  specification 
recursive  definition: 


TRUE  if  Rg.ok  €  H  and  if  H  does  not  contain 
an  event  Pi(x):ok  after  this  request. 

FALSE  otherwise. 

S  for  our  token  passing  example  is  given  by  the  following 
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1.  0  €  5 

2.  VZTeS:  let  x  =  Cu,rHold(H): 

(i)  H  +  Q,:F  6  5. 
and  H  +  Q,:T  6  5. 

(ii)  V j  :  if  PendRe<£H,j)  then  H  4-  Px(j):ok  6  S. 

if  ->PendReq(H,j)  then  Zf  +  Pt(j):cR  e  5. 

(in)  Vy^ar:  Vj:  H  +  P,(j):tH  €  S. 

(iv)  Vy  i:  if  -i PendReq(H,y)  then  H  +  /Z,:oit  €  5. 

if  PcndReq(H,  y)  then  Zf  +  if,  :ef?  €  5. 

(v)  H  +  Rg.cH  €  5. 

3.  5  is  the  smallest  set  satisfying  the  above. 

The  specification  S  says  that  the  QUERY  should  always  return  TRUE  to  the  current 
token  holder,  and  that  only  the  current  holder  is  allowed  to  pass  the  token  to  any 
other  client.  This  specification  describes  an  idealized  token  passing  system,  in  the 
sense  that  passing  a  token  is  supposed  to  be  an  instantaneous  event:  A  PASS  oper¬ 
ation  takes  effect  immediately,  because  any  operation  following  a  PASS  is  required 
to  reflect  the  new  token  holder.  Fortunately  our  definition  of  implementation  cor¬ 
rectness  only  requires  that  the  behavior  of  the  system  is  indistinguishable  from  this 
idealized  behavior.  To  illustrate  this  point,  consider  a  system  with  only  two  pro¬ 
cessors.  A  PASS  operation  may  be  implemented  by  simply  sending  a  message  from 
the  previous  to  the  new  token  holder.  An  external  observer  may  see  the  following 
history: 

Zf  =  (Pi(2),  Q2:F,  Qr-T) 
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Obviously  the  pass  message  was  delayed  a  little  so  that  only  the  second  QUERY 
operation  returned  client  2  as  the  current  token  holder.  Although  H  £  S  we  still 
consider  the  implementation  correct,  because,  to  the  clients,  H  is  indistinguishable 
from  the  legal  history 

H  =  (  Qr-F,  Pi( 2),  Qy.T ). 

A. 2  Commutativity  and  Ordering  Constraints 

Next  we  apply  our  theory  from  Chapter  5  to  show  that  the  token  passing  example 
indeed  has  a  CBCAST  implementation.  We  start  by  computing  a  table  of  dependen¬ 
cies  (Table  A.l)  between  events.  There  are  two  pairs  of  events  which  are  completely 
interchangeable  and  need  not  be  considered  separately: 

Ri'.eH  2  Qi-T,  and 
Pi(x):eH  =  QnF. 

For  this  reason  these  events  share  the  same  row  and  column  in  Table  A.l. 

The  table  allows  us  to  verify  that  the  token  passing  specification  is  commutative. 
There  are  two  types  of  update  events: 

Ri’.ok,  and  Pi(j):ok. 

According  to  Definition  5.3  we  have  to  show  that 

Vif:  V  a,  6  update  events  at  different  processors : 

Ha  6  S  A  Hb  6  S  ^  Hob  €  S  A  Hba  6  S  A  Hab  =  Hba. 

Table  A.l  shows  that  for  any  two  such  update  events  a  and  b,  either  the  two  events 
commute  (  o  entry  in  the  table),  or  there  is  no  H  such  that  Ha  and  Hb  are  both 
legal  (  x  in  the  table). 
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Table  A.l:  Dependencies  between  events  in  the  token  passing  specification 


Qr-T 

Ry.eH 

Qr.F 

Pj(y):eH 

Rj  :  eR 

Pj(y)  :  *R 

Rj  :  ok 

Pj(y)  ■  ok 

Qi-T 

Ri-.tH 

X 

0 

0 

X 

0 

X 

QnF 

Pi(x):eH 

0 

0 

0 

o 

0 

m 

Ri  :  eR 

o 

0 

0 

0 

° 

BS 

Pi(x)  :  eR 

X 

0 

X  x  —  j 

o  x^j 

X 

>-*  x  —  j 

0  x^j 

H 

Ri  :  ok 

o 

0 

0 

0 

0 

m 

Pi(x)  :  ok 

X 

0 

0 

X 

X  X  =  j 

0  x£j 

X 

o  The  events  commute,  i.e.,  Ha  =  ffb,  for  all  H. 

The  events  are  incompatible,  i.e.,  there  does  not  exist  any  history 
H  such  that  Ha  and  Hb  would  both  be  legal. 

There  is  an  ordering  constraint  between  the  two  events  (Defini¬ 
tion  5.4). 
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A. 3  Mutually  Exclusive  Events 


Next,  we  need  to  show  that  every  plausible  run  has  an  acyclic  closure.  We  exploit 
the  fact  that  certain  types  of  events  are  mutually  exclusive.  These  are  all  events 
that  indicate  that  its  caller  is  currently  holding  the  token. 

Definition  A.l 

J9;  =  {Q;  :  T,  Ri.cH)  U  {P*(i):ofc  |  for  all  x}  U  {P,(x):eP  |  for  all  x} 

Lemma  A.l 

The  set  Bi  contains  all  events  that  indicate  that  processor  i  is  holding  the  token 
when  the  event  occurs,  i.e., 

a  €  Bi  A  Ha  €  5  =»  CurHold(H)  =  i. 

Proof:  Follows  immediately  from  the  token  passing  specification  and  the  definition 
of  B{.  □ 

Lemma  A.2 

Let  B  =  (J  Bt. 

All  events  in  B  are  mutually  exclusive. 

The  proof  of  a  restricted  form  of  this  lemma  was  already  presented  in  Section  5.2.3. 

Proof:  Consider  a  run  plausible  run  R  with  two  events  a  6  Bi  and  b  €  Bj.  We 
have  to  show  that  a  and  6  cannot  be  concurrent.  We  do  this  by  induction  on  the 
number  of  events  in  R. 

Base  case:  R  contains  no  events  other  than  a  and  b.  Because  R  is  plausible  i2[a]  =  a 
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and  =  b  have  legal  linearizations  Ha  =  (a)  and  #4  =  ( b )  respectively.  By 
Lemma  A.l,  Ha  —  (a)  6  5  and  a  €  B{  imply  that  i  =  CurHold(9)  =  1.  For  the 
same  reason  j  =  CurHold(9)  =  1.  Therefore  a,  b  €  Bj.  Hence  a  and  b  cannot  be 
concurrent,  because  all  events  in  B\  correspond  to  operations  invoked  by  the  same 
processor  (processor  1). 

For  the  induction  step  consider  R  with  more  than  two  events.  Because  R  is  plausible 
f2[a]  and  f?[6]  have  legal  linearizations  Ha  and  respectively.  Let  R!  =  i?[a]  n  i?[6] . 
By  induction  hypothesis  jl[a],i2[&],  as  well  as  R!  do  not  contain  any  concurrent 
events  from  the  set  B.  Therefore  we  can  define  the  following  events: 

c  as  Pi(z):ok  The  last  event  in  R!  H  B. 

a1  =s  Pi(x):ok  The  first  event  after  c  in  i2[a]  n  B. 

U  =  Pj{y):ok  The  first  event  after  c  in  /?[&]  n  B. 

(where  possibly,  but  not  necessarily,  a'  =  a  or  V  a=  b).  Note  that  a' //bl  because 
otherwise  either  a'  or  bf  would  be  in  Rf.  Then  the  histories  Hm  and  Hi,  have  the 
form 

Ha  —  ...  Pi(z):ok  . . .  Pi(x):ok  ...  a 

Hi  =  ...  Pi(z):ok  ...  Pj(y):ok  ...  b 

with  no  pass  events  between  c  and  a'  in  Ha  and  between  c  and  b/  in  Hi,.  Then  Ha 
can  only  be  legal  if  *  =  r,  otherwise  the  operation  Pi{r)  should  return  an  error  code 
cH.  For  the  same  reason  Hy  is  only  legal  it  j  =  z.  Hence  t  as  j,  i.e.,  a'  and  bf  are 
events  at  the  same  processor.  But  that  contradicts  a'//U.  □ 
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A. 4  Acyclicity 

We  already  proved  the  token  passing  specification  to  be  acyclic  in  Section  5.2.3  of 
Chapter  5.  For  the  sake  of  completeness  we  repeat  this  proof  here. 

Theorem  A.l 

The  token  passing  specification  is  acyclic. 

Proof:  Assume  not.  Let  R  be  a  plausible  run  that  contains  a  cycle.  By  Lemma  5.6 
we  may  assume  that  the  cycle  only  has  concurrent  segments.  Consider  the  ordering 
constraint  edges  (“•-►”)  in  such  a  cycle.  The  cycle  cannot  contain  more  than  one 
edge  of  type  (III),  otherwise  there  would  be  two  pass  events  in  different  segments 
of  the  cycle,  which  is  not  possible  since  segments  are  concurrent  and  pass  events 
are  mutually  exclusive.  For  the  same  reason  there  cannot  be  more  than  one  edge  of 
type  (I)  or  (II)  in  the  cycle.  By  Lemma  5.5  the  cycle  has  at  least  two  ‘W’  edges; 
hence  it  must  have  exactly  one  edge  of  type  (HI)  and  one  of  type  (I)  or  (II).  Hence 
the  cycle  is  of  the  following  form: 

C  =  Pj(i):ok  -  Pk(l):eR 
Rf.ok  — *  t 
*-♦  Pj(i):ok 

where  either  e  =  Qi.F  or  e  =  Ri.cR. 

The  first  segment  of  the  cycle  consists  of  the  two  pass  events  a  —  Pj(i)-.ok  and 
6  as  Pk(l):eR.  If  R  is  plausible  than  R[b\  has  legal  linearization  Hk.  Because  a  — ►  6, 
a  is  in  /?[6]  and  therefore  also  in  Hk.  Hence  Hk  has  the  form 

Hh  =  ...  Pj(i):ok  ...  Pk(l):cR 


( 

_ 
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Notice  that  the  return  value  eR  of  the  last  event  (the  pass  operation  failed  because 
processor  /  did  not  request  the  token)  indicates  that  processor  k  is  holding  the  token 
at  that  time.  Therefore  Hi  must  contain  a  pass  event  c  =  Pi(x):ok  between  the  two 
events  a  and  b  in  Hi;  otherwise  processor  i  would  still  be  holding  the  token  at  the 
end  of  Hi.  From  Theorem  5.3  we  know  that  c  cannot  be  concurrent  with  a  or  6; 
hence 


a  — »  c  —*■  b. 

Now  consider  the  event  e  in  the  second  segment  of  the  cycle.  Events  c  and  e  cannot 
be  concurrent,  because  the  operations  were  both  invoked  at  processor  i.  If  c  — ►  e 
we  have  a  — »  c  — »  e;  hence  a  —»  e.  If  e  -4  c  we  have  e  — ►  c  — ►  6;  hence  e  — ►  6.  In 
both  cases  the  two  segments  of  the  cycle  would  not  be  concurrent,  contradicting 
Lemma  5.6.  □ 


Appendix  B 

Invocation- Completion  Model 

In  our  formal  specifications  we  consider  the  execution  of  an  operation  to  be  one 
single  event.  This  does  not  allow  us  to  model  operations  that  explicitly  wait  for 
another  client  to  take  some  action  (i.e.,  invoke  another  operation).  Such  wait- 
semantics  operations  can  be  modeled  if  we  treat  the  invocation  and  the  completion 
of  an  operation  as  two  separate  events. 

We  argue  that  it  is  not  necessary  to  do  this:  A  specification  with  separate  invo¬ 
cation  and  completion  events  can  be  transformed  into  an  equivalent  one-operation- 
one-event  specification  in  which  wait-semantics  operations  are  implemented  by  a 
busy  wait.  This  works  as  follows: 

Say  specification  5  has  an  operation  A  with  separate  events  for  the  invocation 
and  completion  of  A  (lNVOKEA(. . .)  and  COMPLETE A:v).  We  transform  S  into  S' 
by  splitting  A  into  two  parts: 

StartAQ  and  QUERYA(). 

We  then  make  the  following  two  modifications  to  S: 
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1.  Replace  ail  invocation  events  lNVOKEA(. . . )  by  an  event  STARTA(. . .  ):NIL. 
Replace  all  completion  events  COMPLETE  A:u  by  an  event  QUERYA( ):v. 

2.  Add  additional  histories  to  5  that  are  obtained  by  inserting  extra  Query  A 
events  to  existing  histories:  Insert  an  event  QUERYA():PENDING  anywhere  be¬ 
tween  STARTA(. . .  ):NIL  and  QUERYA():v;  insert  an  event  QUERYA():DONE 
anywhere  after  QUERYA():y  but  before  the  next  StartA(.  . .  ):NIL. 

Then  the  effect  of  a  client  invoking  A  in  S  is  the  same  as  the  client  invoking  Start  A 
in  S'  and  then  doing  a  busy  wait 

while  QueryAO  =  PENDING  do  nothing. 

With  this  transformation,  an  implementation  that  satisfies  the  modified  specifica¬ 
tion  S'  will  be  equivalent  to  one  that  satisfies  the  original  specification  5. 
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